<<

Memeta: A Framework for Multi-Relational Analytics on the

Pranam Kolari and Tim Finin University of Maryland, Baltimore County Baltimore, MD 21250 {kolari1, finin}@cs.umbc.edu http://ebiquity.umbc.edu/project/memeta

Introduction Blogosphere Discovery Metadata Extraction and Computation Blog Metadata Analyzer The “memeta” project is developing a framework for study- LanguageLanguage IdentifierIdentifier (RDF/XML/HTML)(RDF/XML/HTML) BlogBlog DirectoriesDirectories ing the structure and content of the blogosphere. We are Metadata + M particularly interested in how metadata about can be BlogBlog IdentifierIdentifier VocabulariesVocabularies BLOGSM Metadata Enriched discovered, extracted and computed, and how this metadata SearchSearch EnginesEngines Blog Index + can be modeled, represented and analyzed to provide new Blog Post Metadata and Content Extraction blog related services. Blog Content Analyzer PingPing ServersServers BLOGS Metadata (RDF/XML/HTML)(RDF/XML/HTML) Weblogs, or blogs, are web sites consisting of dated en- VocabulariesVocabularies tries (posts) typically organized in reverse chronological or- BLOGBLOG Blog Crawler + Servers CONTENT der on a single web page. They have become an important Blogs with Syndication Ping Servers and Language Information Blog Posts, Comments new way to publish information, engage in discussions and and Trackbacks form web communities. Due to their high influence and spe- cialized publishing infrastructure the subset they constitute Figure 1: Memeta discovers blogs, analyzes them and their posts, is popularly known as the blogosphere. extracts and computes metadata, and adds information to its data- The dialogue oriented nature of the blogosphere has man- base. ifested in the use of Comments and Trackbacks as mecha- nisms for managing conversations. Blogs are also highly dynamic and embedded in a community of both casual and Motivated by the requirements from the blogosphere, regular readers, factors that have led to the adoption of new multiple parallel and sometimes conflicting metadata efforts content representation models supported by metadata in sev- exist and are being used, including those from the Seman- eral formats. Blogs are considered as “killer applications” tic Web, Microformats and Structured Blogging initiatives. for the RSS (RDF Site Summary) and syndication for- Independent of which effort or their combination will fi- mats which recommend an XML-based representation for nally prevail, the blogosphere has driven, and will con- machine consumption, representing information like title, tinue to drive the publishing of richer metadata on the web. date, summary, etc. for content (e.g., a blog post). This The specific nature of this content demands new techniques metadata is used by aggregators and search engines to pro- for knowledge management and also motivates most of our vide new blog-related services. work. The popularity of blog aggregationhas in turn encouraged the publishing of explicit author name and sometimes com- Discovery Framework plete profile information using the Dublin Core and Friend- Of-A-Friend metadata vocabularies. In addition, blogs truly While conventional web search engines use crawlers to dis- exploitthe use of tags and the they create. Tags cover new content, the blogosphere has evolved a differ- are explicit metadata labels associated with a post to provide ent approach. The community oriented nature of the bl- a highly subjective topic description, and has been found ogosphere has made the concept of blog directory listings useful for search, browsing and “buzz” analysis. Blogs have popular, whereas the time sensitive nature of content has also popularized the concept of referrals. Bloggers often created the need for intermediary content brokers known as publish frequently visited and followed blogs (known as Update Ping Servers. These servers accept explicit pings blog subscriptions or blogrolls) in well structured formats (i.e., notices of new posts) from newly updated (and cre- like OPML (Outline Processor Markup Language), creating ated) blogs, and route such notifications to aggregators and a referral chain that enables interesting topical and commu- blog search engines. We have developed a framework, as nity oriented web navigation capabilities. shown in Figure 1, that incorporates these additional traits of the blogosphere, combining it with conventional topical Copyright c 2006, American Association for Artificial Intelli- crawling mechanisms to create a representative sampling of gence (www.aaai.org). All rights reserved. the blogosphere. The crawlers we employ make use of sim- ple heuristics and Support Vector Machines (Boser, Guyon, Feature Precision Recall F1 & Vapnik 1992) based models for blog identification (Ko- words .887 .864 .875 lari, Finin, & Joshi 2006). Our language detection module .804 .827 .815 employs a cosine similarity based measure to compare con- anchors .854 .807 .830 tent against known language models. Given the computa- tional resource requirements of analyzing the blogosphere Table 1: Memeta performs well at splog identification using SVMs as a whole; we fetch metadata and content from blogs on a with 19000 word features and 10000 each of URL and anchor text need-to-use basis. Our system has so far collected and ana- features ranked using Mutual Information. lyzed over ten million blogs.

Modeling and Analysis Framework ranked mainly by freshness and relevance, most splogs usu- Figure 2 shows the multi-relational nature of content ex- ally do not have incoming links rendering “link-only” web tracted through the Memeta framework. This is unlike the spam detection ineffective. conventional web graph model that treats a single entity, a Results (measured by precision, recall and F1) from webpage, as a node and models hyperlinksas edges creating our preliminary work on splog detection (Kolari, Finin, a simple relational model. This relational model is then used & Joshi 2006) using different features are shown in Ta- in many interesting applications, including the now popular ble 1. The bag-of-words based features slightly outper- social ranking of pages used by search engines. The nature forms bag-of-outgoingurls(URLs tokenized on ‘/’) and bag- of the blogosphere makes identifying and extracting multi- of-outgoinganchors. Demonstrations based on analyzing 1 ple entities and their inter-relationships possible. This has splogs in ping servers are available online . prompted us to tackle the problem of link-based object clas- sification and link prediction (Getoor & Diehl 2005) in a Continuing Work multi-relational setting. We are developing representation The fairly good performance of URL and anchor-based fea- models for the blogosphere to support specialized multi- tures provide interesting insights into future approaches to relational mining algorithms. the splog detection problem. We are currently experiment- ing with collective and iterative classification in a multi- Preliminary Work on Splog Detection relational setting. Though the blogosphere offers new services providing im- Related concrete problems we are addressing include proved user experience, it is also prone to spam, similar recognizing comment spam and trackbacks; recommending to those on the web in general (Gy¨ongyi & Garcia-Molina blogs to people; modeling trust relationships in blog com- 2005). Referring to the multi-relational model shown in Fig- munities; and spotting trends. These problems are grounded ure 2, three different entities, namely comments, trackbacks in the area of content mining and link prediction. and posts (and the blog as a whole) are usually spammed. We have so far restricted our effort to spam blogs (splogs) Acknowledgments which constitute around 20% of the blogosphere accord- We thank Anupam Joshi, Tim Oates and Akshay Java for ing to independent reports and our own analysis. Specific constructive discussions and James Mayfield for advice and traits of the blogosphere make the splog problem particu- code for language identification. larly challenging. First, blog search engines do not have knowledge of the entire web graph. Second, faster splog References assessment is needed to combat the ease of splog creation Boser, B. E.; Guyon, I. M.; and Vapnik, V. N. 1992. A in the existing infrastructure. Finally, since blog results are training algorithm for optimal margin classifiers. In COLT ’92: Proceedings of the fifth annual workshop on Compu- tational learning theory, 144–152. New York, NY, USA: Multi-Relational Content Model and Representation ACM Press. [MSM] Blogrolls Getoor, L., and Diehl, C. 2005. Link mining: A survey.

Meta- [Blog] SIGKDD Explorations 7(107). [Non-Blog] Gy¨ongyi, Z., and Garcia-Molina, H. 2005. Web spam tax- [Comment] onomy. In First International Workshop on Adversarial

Title Information Retrieval on the Web. Content [Post] RDF MODEL Kolari, P.; Finin, T.; and Joshi, A. 2006. SVMs for the Time RELATIONAL MODEL Blogosphere: Blog Identification and Splog Detection. In Tag AAAI Spring Symposium on Computational Approaches to [Author] GRAPH MODEL Author Analyzing Weblogs, Stanford, March 2006. To Appear. [Trackback]

Figure 2: The blogosphere materializes many relations, making a multi-relational data model appropriate. 1See http://ebiquity.umbc.edu/blogger/?p=429.