<<

: Research Issues, Tools, and Applications

Nitin Agarwal Huan Liu Computer Science and Engineering Department Arizona State University Tempe, AZ 85287 {Nitin.Agarwal.2, Huan.Liu}@asu.edu

ABSTRACT ging. Acknowledging this fact, Times has named “You” as the person of the year 2006. This has created a consider- Weblogs, or , have facilitated people to express their able shift in the way information is assimilated by the indi- thoughts, voice their opinions, and share their experiences viduals. This paradigm shift can be attributed to the low and ideas. Individuals experience a sense of community, a barrier to publication and open standards of content genera- feeling of belonging, a bonding that members matter to one tion services like blogs, , collaborative annotation, etc. another and their niche needs will be met through online These services have allowed the mass to contribute and edit interactions. Its open standards and low barrier to publi- articles publicly. Giving access to the mass to contribute cation have transformed information consumers to produc- or edit has also increased among the people ers. This has created a plethora of open- intelligence, unlike previously where there was no collaboration as the or “collective wisdom” that acts as the storehouse of over- access to the content was limited to a chosen few. Increased whelming amounts of knowledge about the members, their collaboration has developed collective wisdom on the Inter- environment and the symbiosis between them. Nonetheless, net. “” [21], is a phenomenon named by Dan vast amounts of this knowledge still remain to be discovered Gillmor: a world in which “the former audience”, not a few and exploited in its suitable way. In this paper, we intro- people in the back room, now decides what is important. duce various state-of-the-art research issues, review some The “former” consumer of the information becomes the new key elements of research such as tools and methodologies producer, transforming the lecture style of information con- in Blogosphere, and present a case study of identifying the sumption to conversation-based assimilation. influential bloggers in a community to exemplify the integra- tion of some major aspects discussed in this paper. Towards Blogs have also made it easy for the content generators to the end, we also compare and contrast the blogosphere and author content independent of technical challenges of inter- social networks and the research therein. net languages and scripts. Bloggers don’t need to worry about the low level programming details, rather they focus only on the content. This simplifies the content generation 1. INTRODUCTION TO BLOGOSPHERE process to a great extent and attracts novice or even com- Weblogs or Blogs are becoming one of the most popular puter illiterates to participate in blogging activities. Blogs media of and interaction among masses. A provide a platform where anyone can express himself or her- can be defined as a that displays, in reverse self freely without being even restrained by their limited chronological order, the entries by one or more individu- computer knowledge yet being able to publish content on als and usually has links to comments on specific postings. the . on the Internet also facilitates the Each of these entries are called blog posts. A typical blog readers to comment instantly, giving bloggers a feeling of post can combine text, images, and links to other blogs, satisfaction. web pages, and other media related to its topic. Some blog For many years, psychologists, anthropologists and behav- posts provide a list of links to similar or related blog posts. ioral scientists have studied the societal capabilities of hu- Such a list of links is called blogroll. The ability for read- mans. They present studies and results that substantiate ers to leave comments in an interactive environment is an the fact that humans like engaging themselves in complex important part of blogging. People express their opinions, social relationships and yearn to be a part of social groups. ideas, experiences, thoughts, wishes through these free-form People form communities and groups for the same reasons to writings. The individuals who author the blog posts are re- quench the thirst for social interactions. Often these groups ferred as bloggers. The that publish these blog have like-minded members with similar interests who discuss posts are termed as blog sites or blogs. Blog sites of- various issues including politics, economics, technology, life ten provide opinions, commentaries or on a particular style, entertainment, and what have you. These discussions subject, such as food, politics, or ; some function could be between two members of the group or involve sev- more like personal online . The universe of all these eral members. blog sites is often referred as Blogosphere. Internet has virtually reduced the distance between any two There has been a tremendous increase in user-generated con- points on Earth to zero. It has made possible for people tent in the past couple of years via the phenomenon of blog- to connect with each other beyond all geographical barriers. Blogs, on the top of it, has tremendously affected social in- teractions between people and communities. People not only

SIGKDD Explorations Volume 10, Issue 1 Page 18

£ £ £ £ £

   

© ¥ ¡ ¨§ © ¤ ¢¥¦§ ¨ §© participate in regional matters but also international issues. ¡ ¢

They can connect to people sitting on exactly the other side

              

   

             

of globe and discuss whatever they like, i.e., a flat world [18]. 

      & '   

    ! " # !$ % $     ! " Communities can be spread across several time zones. This (

humongous mesh of social interactions is termed as social

'   '    & 

) )

$ !  % ! $ * *$ $ !  * $  $ !

networks. Blogs can be considered as a type of social net- +

          

,

$ ! $! !  " * $  $ ! "

works that encompass interactions between different people, 

     & 

- .

$ $! $  # #  #!$% #/ !  $ #!$ %

members of a community or members across different com-  #

          

*  $ " * $  *$ $!  $  " munities. Each person in a social network is represented  !

as a node and the represent the links or

        

- 

0

$ $! $  # #  $! $ *$  *    $

       

edges among these nodes. Blogosphere comprises of several 

* $ *  $ "  $%  $ !*   # * " focussed groups or communities that can be treated as sub- Table 1: Comparing Individual and Community Blog Sites. graphs. These communities are highly dynamic in nature that have fascinated researchers to study its structural and Blog sites can be categorized into individual blog sites or temporal characteristics. single-authored blog sites and community blog sites There are myriad services offered under the umbrella of or multi-authored blog sites. Individual blog sites are social networks along with Blogs. Other services include the ones and maintained by an individual. Exam- social friendship networks like Friendster1, Facebook2; col- 3 4 ples of individual blogs could be Sifry’s Alerts: David Sifry’s laborative annotation like del.icio.us , StumbleUpon that 8 5 musings (Founder & CEO, Technorati), Ratcliffe Blog–Mitch’s constitute “”; media sharing services like flickr , 9 10 6 7 Open Notebook , The Webquarters etc. On the other YouTube ; and wikis . All these services offer a fertile hand, community blog sites are owned and maintained by a ground for research. In this paper we focus on the blo- group of like-minded users. Examples of community blogs gosphere. could be Google’s Official Blog site11, The Unofficial Apple The popularity and widespread use of blogs can be attributed Weblog12, Engadget13, : A Directory of Won- to the changes brought by Web 2.0 in the way users interact derful Things14 etc. We summarize the differences between with the web. Blogs have been around for quite some time individual and community blogs in Table 1. but it became unprecedently popular with the advent of Web Such an interactive information delivery like blogs 2.0. Although Web 2.0 may not be a technological shift, it hosts a conducive ground for the virtual communities or changed the way now people interact through the Internet. communities that originate over the Internet. There has People could not only consume information on the Internet been a lot of ongoing research to mine knowledge in Blo- but also contribute to it. Easier, more intuitive interfaces gosphere. This survey is organized as follows: Section 2 with desktop-like experience enticed users to stay connected introduces various issues pertinent to the blogosphere. Sec- and contribute their knowledge in terms of blog posts, tion 3 reviews tools, general methodologies, datasets, and articles, developing , etc. Wikis is an excellent performance metrics that are useful for conducting research example of Web 2.0 that slowly takes over online encyclo- in Blogosphere. Section 4 presents a case study. Section 5 pedias due to its sheer breadth of knowledge made possible discusses the connection between Blogosphere and state of by mass . Since more and more people are trying to the art social networks. Section 6 concludes the paper with be a part of Web 2.0, it has generated enormous amounts some possible future directions for research in the blogo- of information on the web which is also known as collective sphere. wisdom or open source intelligence. The basic differences between Web 1.0 (or, the way Web was accessed previously) and Web 2.0 can be listed as follows: 2. RESEARCH ISSUES Here we study various research issues and challenges with • Former information consumers are now also producers. potential applications. We discuss the research issues in Web 2.0 has allowed the mass to contribute and edit terms of modeling, clustering, mining, community discov- articles through wikis and blogs. ery and factorization, influence and propagation, trust and reputation, and filtering. • Giving access to the mass to contribute or edit has also 2.1 Modeling the Blogosphere increased collaboration among the people unlike Web The first and foremost challenge lies in developing an ap- 1.0 where there was no collaboration as the access to propriate model for the blogosphere. Often researchers and the content was limited to a chosen few. practitioners ask, which is the model that best describes the structure and properties of the blogosphere. Such a model • Increased collaboration has generated enormous open can help in gaining deeper insights into the relationships source intelligence or collective wisdom on the internet between bloggers, commenters, blog posts, comments, view- which was not there in Web 1.0. ers/readers, and different blog sites in the blogosphere. This 1http://www.friendster.com/ 8http://www.sifry.com/alerts/ 2http://www.facebook.com/ 9http://www.ratcliffeblog.com/ 3http://del.icio.us/ 10http://webquarters.blogspot.com/ 4http://www.stumbleupon.com/ 11http://googleblog.blogspot.com/ 5http://www.flickr.com/ 12http://www.tuaw.com/ 6http://www.youtube.com/ 13http://www.engadget.com/ 7http://www.wikipedia.org/ 14http://boingboing.net/

SIGKDD Explorations Volume 10, Issue 1 Page 19 can help us in understanding and defining various concepts like how often people create blog posts, burstiness and pop- of the blogosphere at an abstract level. These type of mod- ularity, how these blog posts are linked, and what is the link els would also help in tackling several other challenges of the density. They reported that these phenomena follow power blogosphere. A model for the blogosphere would be useful law distributions. Based on their findings, they developed in generating an artificial dataset, tuning the parameters a cascade model similar to the SIS (susceptible-infected- to simulate a special scenario and compare different algo- susceptible) model from the epidemiology. This way any rithms and studies. Such a model will also help in study- randomly picked blog can infect its uninfected immediate ing peculiarities in the blogosphere and infer latent patterns neighbors probabilistically, which repeats the same process and structures that could explain certain phenomena like until no node remains uninfected. In the end, this gives a community discovery, spam blogs, information diffusion and blog network. Kumar et al. [37] use the blogrolls given on influence, etc., to be discussed later in this section. a blog post to create a network of connected posts with the Modeling the blogosphere is often associated with model- underlying assumption that blogrolls have links to related ing the web. Researchers represent the web as a webgraph, or similar blog posts. A lot of research has been conducted where each webpage forms a node and between that posits a known network structure of the blogosphere them as edges. This kind of representation results in a di- to model the problem domain. Such models are specific rected cyclic graph. Weights can be associated with these to problem domains and are discussed next in reference to edges. Such a model that converts the web into a graphic problem domains. model is extensively exploited. One prominent example is the search engine domain which relies on this graph based 2.2 Blog Clustering model of the web to rank webpages [10; 32]. Although the Blogosphere is a storehouse of several publicly regulated me- web models seem to be an appropriate choice for modeling dia. Technorati15 reported that 175,000 blog posts were the blogosphere but certain key differences prevent reusing created daily which is 2 blog posts per second. This ex- the web models in the blogosphere domain. First, mod- plosive growth makes it beyond human capabilities to look els developed for the web assumes a dense graph structure for interesting and relevant blog posts. Therefore a lot of due to a large number of interconnecting hyperlinks within research is going on to automatically cluster different blogs webpages. This assumption does not hold true in the blo- into meaningful groups such that readers can focus on in- gosphere, since the structure in the blogosphere is teresting categories, rather than filtering out relevant blogs very sparse, as shown in [35]. Second, the level of interac- from the jungle. Often blog sites allow their users to pro- tion in terms of comments and replies to a blog post makes vide tags to the blog posts. The human labeled infor- the blogosphere different from the web. Third, the highly mation forms the so-called “folksonomy”. Brooks and Mon- dynamic and “short-lived” nature of the blog posts could tanez [11] presented a study where the human labeled tags not be simulated by the web models. Web models do not are good for classifying the blog posts into broad categories consider this dynamicity in the web pages. They assume while they were less effective in indicating the particular web pages accumulate links over time. However, in a blog content of a blog post. They used the tf-idf measure to pick network, where blog posts are the nodes, it is impractical the top three most famous words in every blog post and to construct a static graph like the one for the web. These computed the pairwise similarity among all the blog posts differences necessitate the need for a model more towards and clustered them. They compared the results with the the characteristics of the blogosphere. clustering obtained using the human labeled tags and re- There are several models for the web like random graph [49], ported significant improvement. In another research [39], preferential attachment graph [6], hybrid graph [44], and authors tried to cluster blog posts by assigning different random walk on graph [9]. A random graph constructs edges weights to title, body and comments of a blog post. How- between each pair of nodes with some probability which fails ever, these approaches rely on the keyword-based clustering to exhibit the power law degree distribution or scale-free which suffers from high-dimensionality and sparsity. Agar- graph structure. For this reason random graph models can- wal et al. [2] proposed WisClus that uses the collective wis- not be used to model the blogosphere. Preferential attach- dom of the bloggers to cluster the blogs. They have used the ment graph models follow the phenomenon of “the rich gets blog categories and construct the category relation graph to richer”, where the probability of a new edge to a node to be merge different categories and cluster the blogs that belong added is based on its degree. The more the degree of a node to these categories. Edges in the category relation graph the better the chances are for a different node to be con- represent the similarity between different categories which nected with this node. These models exhibit the power law are the nodes in this graph. The similarity between two distribution. Hybrid graph models are basically a mixture of categories is computed using the number of blogs that si- both random graphs and preferential attachment models, so multaneously uses these categories as their blog labels. Ex- as to give a lucky “poor” a chance to get “rich”. Blogosphere periments show that the collective wisdom based clustering can be modeled using this model with some modifications. performs better than keyword based clustering even after re- To solve the problem of irreducibility (strong connectedness ducing the dimensionality and sparsity to the concept space with few isolated subgraphs), random walk on a graph model using Latent Semantic Indexing (LSI) [14]. Clustering dif- proposes a random jump with a fixed probability between ferent blog posts would also help blog search engines like 0.8 and 0.9 in addition to the preferential attachment model. Technorati to narrow down the search space once the query The above models have been used to model the blogosphere context is clear. Websites like Blogcatalog16 organize blogs with modifications, but these models could not explain the into a taxonomy that helps in focussed browsing of blogs. blogosphere precisely. This has motivated researchers to come up with models specific to the blogosphere. Leskovec 15http://www.technorati.com/ et al. [38] studied the temporal patterns of the blogosphere 16http://www.blogcatalog.com/directory

SIGKDD Explorations Volume 10, Issue 1 Page 20 2.3 Blog Mining not “person-to-person”, rather it is more “person-to-group”. Blog mining as a technique is evolving and taking the form Blanchard and Markus [8] studied “Virtual Settlement” - a of qualitative research. Companies are using blogs as qual- Multiple Sport Newsgroup and analyzed the possibility of itative research tools. Historically, the interaction between emerging virtual communities in it. They studied the char- marketers and consumers has been a closed loop. Marketers acteristics of the newsgroup by conducting interviews with used to send out messages to consumers and sought their three different kinds of members: leaders (active and well feedback through traditional research. Now, consumers can respected), participants (active occasionally to events like not only speak their mind but also broadcast their opin- triathlons) and (readers only). They reported that ions. The surge of marketing messages combined with low different virtual communities emerge between athletes and consumer trust, has led to people relying on one another’s those who join the community to keep themselves informed opinions to make informed decisions, prompting conversa- of the latest developments. tions between them. These interactions are found on the blogs and have attracted the attention of several compa- nies. Blogs are immensely valuable resources to track con- 2.5 Influence in Blogs and Propagation sumers’ beliefs and opinions, initial reaction to a launch, As communities evolve over time, so do the bellwethers or understand consumer language, track trends and buzzwords, leaders of the communities who possess the power to influ- fine tune information needs. Blog conversations leave behind ence the . According to the studies in [30], 83% the trails of links, useful for understanding how information people prefer consulting family, friends or an expert over flows and how opinions are shaped and influenced. Track- traditional advertising before trying a new restaurant, 71% ing blogs also help in gaining deeper insights as bloggers people prefer to do so before buying a prescription drug or share their views from various perspectives hence giving a visiting a place, 61% of people prefer to do so before watch- ’context’ to the information collected. ing a movie. This style of marketing is known as “word- Mining sentiments from free text forms poses several chal- of-mouth”. “Word-of-mouth” has been found to be more lenges as compared to the historic feedback and surveys. effective than the traditional advertising in physical com- A prototype system called Pulse [19] uses a Na¨ıve Bayes munities. Studies from [30] show that before people buy, classifier trained on manually annotated sentences with pos- they talk, and they listen. Experts can influence decisions itive/negative sentiments and iterates until all unlabeled of people. For this reason these experts are aptly termed as data is adequately classified. Another system presented the Influentials. Influential bloggers tend to submit influ- in [5] improves the blog retrieval by using opinionated words ential blog posts that affect other members’ decisions and acquired from WordNet in the query proximity. Some well- opinions. They accrue respect in the community over time. known opinion mining and techniques [41] Other members tend to listen to what the influentials say could also be borrowed from domain due to high before making decisions. textual nature of blogs. Identification of these influential bloggers [4] could lead to several interesting applications. The influentials are poten- 2.4 Community Discovery and Factorization tial market-movers. Since they can influence buying deci- Another important research which branched out from the sions of mainstream, companies can promote them as latent blog-site clustering is determining and inferring communi- brand ambassadors for their products. Being such a highly ties. Several studies looked into identifying communities interactive medium, blogs tend to host several vivid dis- in Blogosphere. One method that researchers commonly cussions on various issues including new products, services, use is content analysis and text analysis of the blog posts marketing strategies and their comparative studies. Often to identify communities in the blogosphere [7], [16], [37]. this discussion also acts as “word-of-mouth” advertising of Kleinberg [32] used an alternative approach in identifying several products and services. A lot of advertising compa- communities in web using a hub and authority based ap- nies, approximately 64% [17] have acknowledged this fact proach, clustering all the expert communities together by and are shifting their focus towards blog advertising and identifying them as authorities. Kumar et al. [36] extended identifying these influentials. the idea of hubs and authorities and included co-citations as The influentials could sway opinions in political campaigns, a way to extract all communities on the web and used graph elections and reactions to government policies [15]. Because theoretic algorithms to identify all instances of graph struc- they know many people and soak up a large amount of in- tures that reflect community characteristics. While Chin formation, the influentials stand out as knowledgeable, in- and Chignell [12] proposed a model for finding communities formed sources of advice and insight. Approximately, 84% taking the blogging behavior of bloggers into account, they of the influentials in physical communities are interested in aligned behavioral approaches in studying community with politics and are sought out by others for their perspectives the network and link analysis approaches. They used a case on politics and government, 55% on a regular basis. study to first calibrate the measure to evaluate a commu- The influentials could help in customer support and trou- nity based on behavioral aspects using a behavioral survey bleshooting. A lot of companies these days host their own which could be generalized later on, pruning the need of customer blogs, where people could discuss issues related such surveys. to a product. Often the influentials on these blogs trou- Several researchers have also studied community extraction bleshoot the problems peer consumers are having, which and social network formation using newsgroups and discus- could be trusted because of the sense of authority these in- sion boards. Although different from the blogosphere we in- fluentials possess. Often the influentials offer suggestions to clude these here because discussion boards and newsgroups improve their products. These invaluable comments could are also very similar to blogs in the sense that they do not be really helpful for companies and customers. Instead of go- have an explicit link structure, and the communication is ing through each member’s blog posts, companies can focus

SIGKDD Explorations Volume 10, Issue 1 Page 21 on the influentials’ blog posts. For instance, Macromedia17 ing [46; 31] is how to maximize the total influence among the aggregates, categorizes and searches the blog posts of 500 nodes (blog sites) by selecting a fixed number of nodes in the people who write about Macromedia’s technology. network. A greedy approach can be adopted to select the Some recent numbers from Technorati show a 100% increase most influential node in each iteration after removing the se- in the size of the blogosphere every six months. It has grown lected nodes. This greedy approach outperforms PageRank, over 60 times during the past three years. Approximately 2 HITS and ranking by number of citations, and is robust in new blog posts appear every second18. New blog posts being filtering splogs (spam blogs) [28]. generated with such a blazing fast rate, it is impossible to Finding influential blog sites is perpendicular to the problem keep track of what is going on in the blogosphere. Many blog of identifying influential bloggers. Given the nature of the readers/subscribers just want to know the most insightful blogosphere, influential blog sites are few. A large number of and authoritative stories before delving into the discussions. non-influential sites belong to the [3] where abun- Blog posts from influential bloggers would exactly serve this dant new business, marketing, and development opportuni- purpose by standing out as representative articles of a blog ties can be explored. Agarwal et al. [4] studied and modeled site. The influentials can be the showcases of a group on the the influence of a on a community blog site regard- blogosphere. less of the site being influential or not. They modeled the These interesting applications have attracted a surge of re- blog site as a graph using inherent link structure, including search in identifying influential blog sites as well as influ- inlinks and outlinks, as edges and treating different blog- ential bloggers. Some try to find influential blog sites, in gers as nodes. Using the link structure the influence flow the entire blogosphere and study how they influence the ex- across different bloggers is observed, recursively. Other blog ternal world and within the blogosphere [20]. The problem post level statistics like blog post quality and comments’ of ranking blog sites or bloggers differs from that of finding information were also used to achieve better results. The authoritative webpages. As pointed out in [35], blog sites in model used different weights to regulate the contribution of the blogosphere are very sparsely linked and it is not suitable different statistics. These weights could be tuned to obtain to rank blog sites using Web ranking algorithms like PageR- different breeds of influential bloggers. Influential bloggers ank [43] and HITS [32]. The Random Surfer model of web- are not necessarily active bloggers at a blog site [4]. Many page ranking algorithms [43] does not work well for sparsely blog websites list top bloggers or top blog posts in some time linked structures. The temporal aspect is most significant frame (e.g., monthly). Those top lists are usually based on in the blog domain. While a webpage may acquire author- some traffic information (e.g., how many posts a blogger ity over time (its adjacency gets denser), a blog post posted, or how many comments a blog post received) [20]. or a blogger’s influence diminishes over time. Consequently, With the speedy growth of the blogosphere, it is increasingly the adjacency matrix of blogs (considered as a graph) will difficult, if at all possible, to manually track the development get sparser as thousands of new sparsely-linked blog posts and happenings in the blogosphere, in particular, at many appear every day. blog sites where many bloggers enthusiastically participate Some recent work [35] suggests to add implicit links to in- in discussions, getting information, inquiring and seeking crease the density of link information based on topics. If answers, and voicing their complaints and needs. two blogs are talking about the same topic, an edge can be added between these two blogs based on the topic similar- 2.6 Trust and Reputation ity or information epidemics. However, constructing links Open standards and low barrier to publishing has allowed based on the topic models still remains an area of research. anyone to submit blog posts and contribute to the partic- A similar strategy adopted by Adar et al. [1] is to consider ipatory . On one hand, it has created an over- the implicit link structure of blog posts. In their iRank al- whelming amount of collective wisdom; on the other hand, gorithm, a classifier is built to predict whether or not two it has made difficult for readers to decide whom to trust or blogs should be linked. The objective in this work is to find believe. This has been a great challenge since the incep- out the path of infection (how one piece of information is tion of the which created the problem of propagated). iRank tries to find the blogs which initiates authoritative webpages. Kleinberg [32] and Page et al [43] the epidemics. Note that an initiator might not be an influ- tried to give a solution for this problem by exploiting the ential as they might affect only limited blogs. Influentials link structure of webpages. But social networking sites and should be those which play a key role in the information especially Blogosphere allow mass to create and edit content epidemics. compromising (risking) the sanctity of the original content. Gruhl et al [24] study information diffusion of various top- Researchers anticipated this problem in social networking ics in the blogosphere between different blog sites, draw- and recommender systems and conducted research in those ing on the theory of infectious diseases. A general cascade areas. However, the potential of this research is still under- model [23] is adopted. They derived their model from inde- estimated for the blogosphere domain and not much research pendent cascade model and generalized it to the general cas- is reported. Here we briefly point out the work already done cade model by relaxing the independence assumption. They in social networks to provide an insight to this problem and associate ‘read’ probability and ‘copy’ probability with each mention the current state of trust related research in Blogo- edge of the blog graph indicating the tendency of a blog to sphere. be read and copied, respectively. They also parameterize In social networks it is important not only to detect the the stickiness of a topic which is analogous to the virulence influential members or experts in case of knowledge sharing of a disease. An interesting problem related to viral market- in communities but also to assess to what extent some of the members are recognized as experts by their colleagues 17http://weblogs.macromedia.com/ in the community. This leads to the estimation of trust and 18http://www.sifry.com/alerts/archives/000436.html reputation of these experts. Some social friendship networks

SIGKDD Explorations Volume 10, Issue 1 Page 22 like Orkut19 allow users to assign trust ratings implying a involved in a relationship, trust is not necessarily identical more explicit notion of trust. Whereas some websites have in both directions. This is contrary to what was proposed an implicit notion of trust where creating a link to a person in [50], who assume symmetric trust values in the social net- on a webpage implies some amount of business trust for the work between two members. Also, consolidating the trust person. In other cases, trust and reputation of experts could scores for a member and computing a global trust score for be typically assessed as a function of the quality of their each member might not give a reasonable estimation. Trust response to other members’ knowledge solicitations. Pujol of a member is absolutely a personal opinion. Therefore, et al [45] proposed a NodeMatching algorithm to compute authors propose personalization of trust which means that the authority or reputation of a node based on its location in a member could have different trust values with respect to the social friendship network. A node’s authority depends different members. Guha et al [25] proposed another trust upon the authority of the nodes that refer to this node and propagation scheme in social friendship networks based on a also on the authority of other nodes that this node refers series of matrix operations, including the element of distrust to. The basic idea is to propagate the reputation of nodes along with the trust scores. in the social friendship network. Although there has been a lot of work that deals with trust While Pujol et al. [45] proposed an approach to establish in social networks and recommender systems, not many have reputation based on the position of each member in the considered trust in the blogosphere. Researchers have tried social friendship network, the authors of [50] developed a to transform the blogosphere domain to the problem domain model for based on the Dampster- considered in trust in social networks. Authors in [29] con- Shafer theory of evidence in the wake of spurious testimonies siders a window of words around the links in a blog post provided by malicious members of the social friendship net- to mine the sentiments about the cited blog post. Using work. Each member of a social friendship network is an VoteLinks, these sentiments can be classified as positive, agent. Each agent has a set of acquaintances a subset of negative or neutral sentiments. These bags of sentiments which forms its neighbors. Each agent builds a model for can then be used to compute the link polarity between a its acquaintances to quantify their expertise and sociabil- pair of blog posts. Using Gruhl’s et al [25] trust propaga- ity. These models are dynamic and change based on the tion model, they compute the trust in the network of blog agent’s direct interactions with the given acquaintance, in- sites in the blogosphere. There is still a lot of information teractions with agents referred to by the acquaintance, and unexploited in this approach like comments from the readers on the ratings this acquaintance received from other agents. on the blog post that can also be used to judge a blogger’s The authors point out a significant problem with this ap- or a blog post’s trust. proach which arises if some acquaintances or other agents generate spurious ratings or exaggerate positive or negative 2.7 Filtering Spam Blogs ratings, or offer testimonies that are outright false. Spam blogs, often called splogs, is one of the major con- Sabater and Sierra [47] propose a combination of reputa- cerns in the blogosphere. Besides degrading search quality tion scores on three different dimensions. They combined results it also wastes the network resources. So researchers reputation scores not only through social relations governed are looking into this aspect of the blogosphere. Although it by a social friendship network (termed as social dimension) is a relatively new phenomenon, researchers have compared but also past experiences based on individual interactions it with the existing work on web (link) spam detection. For (termed as individual dimension) and reputation scores based web spam detection, authors in [42] distinguish between nor- on other dimensions (termed as ontological dimension). For mal web pages and spam webpages based on the statistical large social networks it is not always possible to get reputa- properties like, number of words, average length of words, tion scores based on just the individual dimension, so they anchor text, title keyword frequency, tokenized URL. Some can use the social dimension and ontological dimension that works [26; 27] also use PageRank to compute the spam score would enhance the reputation estimation by considering dif- of a webpage. Some researchers consider splogs as a special ferent contexts. The ontological dimension is very similar to case web spam. Authors in [33; 34] consider each blog post the work proposed in [48], where the authors recommend col- as a static webpage and use both content and hyperlinks to laboration in social networks based on several factors. They classify a blog post as spam using a SVM based classifier. explain the importance of context in recommending a mem- However, there are some critical differences between web ber of social network for collaboration. spam detection and splog detection. The content on blog In [22], authors consider those social networking sites where sites is very dynamic as compared to that of web pages, so users explicitly provide trust ratings to other members. How- content based spam filters are ineffective. Moreover, spam- ever, for large social networks it is infeasible to assign trust mers can copy the content from some regular blog posts ratings to each and every member so they propose an in- to evade content based spam filters. Link based spam fil- ferring mechanism which would assign binary trust ratings ters can easily be beaten by creating links pointing to the (trustworthy/non-trustworthy) to those who have not been splogs. Authors in [40] consider the temporal dynamics of assigned one. They demonstrate the use of these trust val- blog posts and propose a self similarity based splog detection ues in an filtering application and report encourag- algorithm based on characteristic patterns found in splogs ing results. Authors also assume three crucial properties of like, regularities or patterns in posting times of splogs, con- trust for their approach to work: transitivity, asymmetry, tent similarity in splogs, and similar links in splogs. and personalization. These trust scores are often transitive, meaning, if Alice trusts Bob and Bob trusts Charles then Al- 3. TOOLS AND OTHER RELATED ISSUES ice can trust Charles. Asymmetry says that for two people Having presented the status quo of the ongoing research in Blogosphere, we now discuss available tools to analyze 19http://www.orkut.com the domain, methodologies across different disciplines, data

SIGKDD Explorations Volume 10, Issue 1 Page 23 collection and pre-processing, and performance metrics. • SocNetV29: A Linux based SNA and visualizing utility. SocNetV can compute network and actor properties, 3.1 Tools and APIs such as distances, centralities, diameter etc. Further- Several modeling tools are available to simulate the social more, it can create simple random networks (lattice, networks that help study various characteristics of these net- same degree, etc.). works and conduct experiments, including: Besides simulation and modeling toolkits there are APIs • NetLogo20: A multi-agent programming language and from , StumbleUpon, Technorati, del.icio.us, Digg, modeling environment designed in Logo programming etc. These APIs could be used to download real-world data language. Modelers can give instructions to hundreds and study properties of social networks and concepts such or thousands of concurrently operating autonomous as small worlds, random networks, scale-free networks, laws “agents”. This helps in exploring the connection be- and distributions (normal distribution, Zipf’s law, power tween the individuals (micro-level) and the patterns law), search in networks, computation/propagation of in- that emerge from the interaction of many individuals fluence and trust, diffusion (epidemics), robustness in net- (macro-level). works, collective wisdom, collaborative filtering, social deci- sion making, social criminals, individual profiling and pri- 21 • StarLogo : An extension of Logo programming lan- vacy, story construction, provenance, and the unique char- guage. It is used to model the behavior of decentralized acteristics of Long Tail blogs/blog sites and Short Head systems like social networks. blogs/blog sites. • Repast22: Recursive Porous Agent Simulation Toolkit 3.2 Methodologies is an agent-based social network modeling toolkit. It We now discuss the broad technical concepts that form the has libraries for genetic algorithms, neural networks, necessary background in conducting research in social net- etc. and allows users to dynamically access and modify work domain through centrality measures, network models, agents at run time. content analysis, link analysis, supervised learning, decision • Swarm23: A multi-agent simulation package to simu- theoretic approach, and agent-based modeling. late the social or biological interaction of agents and Network centrality measures form an essential part of their emergent collective behavior. social network analysis. Social network analysis is used to identify leaders, mavens, brokers, groups, connectors (bridges • UCINet24: A comprehensive package for the analy- between groups), mavericks, etc. Researchers have used sev- sis of social network data including centrality mea- eral centrality measures to gauge the information flow across sures, subgroup identification, role analysis, elemen- a social network which could help in identifying different tary graph theory, and permutation-based statistical roles of nodes mentioned above. Centrality measures help analysis. In addition, the package has strong matrix in studying the structural attributes of nodes in a network. analysis routines, such as matrix algebra and multi- They help in studying the structural location of a node in variate statistics. the network which could decide the importance, influence or prominence of a node in the network. Centrality measures 25 • Pajek : (Slovenian: spider) A software for analyzing help in estimating the extent to which the network revolves and visualizing large networks like social networks. around a node. Different centrality measures include Degree centrality, Closeness centrality, Betweenness centrality, and • Network package in “R”26: The network class can rep- Eigenvector centrality. Degree centrality refers to the total resent a range of relational data types, and support number of connections or ties a node has in the network. arbitrary vertex/edge/graph attributes. This is used This could be imagined as a “hubness” value of that node. to create and/or modify the network objects and is Rows or column sums of an adjacency matrix would give the used for social network analysis (SNA). degree centrality for that node. Closeness centrality is de- • InFlow27: Another integrated product for network anal- fined by the sum of all the geodesic distance of a node with ysis and visualization. It has been used in the SNA all other nodes in the network. This could be imagined as domain. the “nearest” node to the other nodes in the network. Be- tweenness centrality refers to the extent a node is directly • NetMiner28: A tool for exploratory network data anal- connected to nodes that are not directly connected, or the ysis and visualization. NetMiner allows to explore net- number of geodesic paths that pass through this node. This work data visually and interactively, and helps in de- evaluates how well a node can act as a “bridge” or interme- tecting underlying patterns and structures of the net- between different subgraphs. A high betweenness cen- work. trality node can become a “broker” between different sub-

20 graphs. Eigenvector centrality defines a node to be central http://ccl.northwestern.edu/netlogo/ if it is connected to those who are central. This could be 21http://education.mit.edu/starlogo/ 22 gauged as the “authoritativeness” of a node. It is the prin- http://repast.sourceforge.net/ cipal eigenvector of the adjacency matrix of the network. 23http://www.swarm.org/wiki/Main Page 24 Other SNA measures used for analyzing social networks http://www.analytictech.com/ are clustering coefficient (the likelihood that associates of 25http://vlado.fmf.uni-lj.si/pub/networks/pajek/ 26 a nodes are associates among themselves to ensure greater http://cran.r-project.org/src/contrib/Descriptions/network.htmlcliquishness), cohesion (extent to which the actors are con- 27http://www.orgnet.com/inflow3.html 28http://www.netminer.com/ 29http://socnetv.sourceforge.net/

SIGKDD Explorations Volume 10, Issue 1 Page 24 nected directly to each other), density (proportion of ties of like Blogosphere assume implicit link information among a node to the total number of ties this node’s friends have), bloggers. Links could be constructed using the topic analy- radiality (extent to which an individual’s network reaches sis. For example, blog posts talking about same topic could out into the network and provides novel information), and be connected. Supervised learning algorithms could be reach (extent to which any member of a network can reach used to predict topics of unlabeled blog posts, which helps other members of the network). achieve link construction. Useful concepts in modeling social networks demand a lucid Several studies have been conducted to study decision the- understanding of various network models such as scale-free, oretic approach for group-individual interaction and the ef- random [49], preferential attachment [6], hybrid [44], fect of decision on an individual and/or a community as a cascade models, etc. The link structure could be modeled whole. Decision theory studies what is the best possible de- using the scale free power law distribution (P (k) ∝ k−γ ). cision to take given a fully informed decision maker. In the There are two generic aspects of real networks (e.g., So- context of social networks this could be applied in finding cial networks, Blog networks, World Wide Web, biological the node in the network that is the best to make decisions networks, etc.) that make scale-free power law models an with least possible side-effects and maximum possible gains appropriate choice as compared to random models. First, for the rest of the nodes. This is a classic subject of study the number of nodes (N) in the real networks is not static. in microeconomics. Some decisions are difficult to make be- N increases throughout the lifetime of the network and the cause of the need to reach a consensus among other members new nodes attach to the vertices already present in the net- and the uncertainty in the response of different individuals. work. Second, the random network models assume that the The analysis of such social decisions is dealt through game probability that two vertices are connected is random and theory. uniform. However, most real networks exhibit preferential Social networks are also studied from the perspective of connectivity. For example, a newly created webpage will agent-based modeling. Basically each node in a social be more likely to include links to well-known popular docu- network can be treated as an agent. This agent could be a ments with already high connectivity. Thus the probability blogger in the blogosphere domain. Then assuming the net- with which a new vertex connects to the existing vertices work follows some distribution, usually a scale-free model, is not uniform; there is a higher probability that it will be we can model the decision making ability of the agent prob- linked to a vertex that already has a large number of connec- abilistically. This can help us in studying the factors that tions. This property of scale free power law models is also affect his/her blogging behavior, what and how (s)he makes known as preferential attachment models. Some works [44] decisions, etc. Neural networks or genetic algorithms could have shown the relative importance of hybrid models in sim- also be used to train the model of these agents to closely ulating social networks by determining the appropriate pro- simulate some real-world scenario, which means, iteratively portion of random and scale free networks. Information flow tuning the model parameters and keep improving the model. across the network could be studied with the help of cascade models. Information diffusion could be considered analogous 3.3 Data Collection to the spread of a viral disease. Models from epidemiology Data collection is an essential part of studying and evaluat- have been borrowed and studied to model diffusion aspect in ing concepts in empirical research. Since social networking social networks. The key is to exploit different properties of is a socio-psychological phenomenon and is more prevalent scale-free, random, preferential attachment, hybrid models, in the real world than the theoretical study, so enormous cascade models to efficiently and effectively model the social amounts of data exist on actual social networking websites. networks. Moreover, since this involves user information, it is sensi- Blogs have rich textual content. Not only people create new tive when the data is used in open research. Much of such content, they also enrich the existing content by providing data is unavailable due to privacy concerns. A few available meta data such as labels and tags. These human-generated datasets are: tags are also called “folksonomies”. State-of-the-art con- tent analysis techniques could be used for basic clustering, • Nielsen Buzzmetrics dataset: The dataset consists of about 14M blog posts from 3M blog sites collected classification of the blog posts/blog sites. Traditional text 30 analysis approaches like tf-idf could be used for indexing by Nielsen BuzzMetrics in May 2006. The data is the blog entries. Folksonomies could be considered as class annotated with 1.7M blog-blog links. However, up to labels and supervised machine learning could be performed, a half of the blog outlinks are missing. Only 51% of classification models could be learned on labeled dataset, the total blog posts are in English. and learned models could be used to predict the tags of • Enron Email dataset: It contains data from about 150 unlabeled corpus. This forms an essential concept for semi- users, mostly senior management of Enron. The cor- automatically generating “tag-clouds” with least human in- pus31 contains a total of about 0.5M messages. People tervention. have studied the social networks between users based Link analysis helps in understanding several interesting on link construction. Links are constructed based on phenomena of social networks. Text around the links give email senders and recipients. us knowledge about the linked blog posts. Based on the links, hubs and authorities could be discovered. This could • APIs: APIs provided by Facebook, Digg, Stumble- be achieved exactly the same way as it is done for webpages. Upon, Technorati, del.icio.us etc. can be used to down- This approach could lead to the identification of expert com- load data from a corresponding social networking web- munities. Several researchers have also pointed out the spar- site. Nevertheless, the API usage is often restricted sity in the link structure of social networks which makes it 30http://www.nielsenbuzzmetrics.com/ different from the World Wide Web model. Many of them 31http://www.cs.cmu.edu/∼enron/

SIGKDD Explorations Volume 10, Issue 1 Page 25 to either last 30 days or top 100 results or in case of of a post’s novelty. A large number of outlinks (θ) may friendship networks like Facebook, one can only down- suggest that a post refers to many other blog posts or load data of his/her network of friends. articles, indicating that it is less likely to be novel. Correlation experiments in [4] have reported that the There is another more challenging yet appropriate option number of outlinks is negatively correlated with the to obtain datasets. People can write crawlers and parsers number of comments which means more outlinks re- to download data from blog sites. These custom datasets duces people’s attention. can be downloaded and pre-processed to serve more specific needs. We discuss more in Section 4. • Eloquence - An influential is often eloquent [30]. This 3.4 Experiments and Performance Metrics property is most difficult to approximate using some The fact that many concepts like, influence, trust, infor- statistics. Given the informal nature of the blogo- mation propagation, identification of information routers, sphere, there is no incentive for a blogger to write a brokers, etc. in social network domain like Blogosphere are lengthy piece that bores the readers. Hence, a long socio-psychological and highly subjective in nature, setting post often suggests some necessity of doing so. There- up experiments and evaluating the results is non-trivial. The fore, we use the length of a post (λ) as a heuristic absence of ground truth makes it even harder to compare dif- measure for checking if a post is influential or not. ferent approaches available in the spectrum. Lack of ground Correlation experiments in [4] have reported that the truth makes an option to use search engines’ ranking algo- blog post length is positively correlated with number rithms as the baseline for most of the existing works, even of comments which means longer blog posts attract though theoretically it has been proven that current search people’s attention. engines are not suited for social network data and link struc- ture. Recent work like [4] have used another Web 2.0 ap- 4.2 Data Collection plication, i.e., Digg, to evaluate the influence in the blogo- Data collection is one of the critical tasks in this work. Since sphere. Results and discussion are included in Section 4. there are no available blog data sets for the purposes of our experiments, we need to collect real-world data. There exist 4. A CASE STUDY many blog sites. Some like Google’s Official Blog site act as a notice board for important announcements rather than Here we present a study of identifying influential bloggers for discussions, sharing opinions, ideas and thoughts; some in a community [4]. We discuss model development, data do not provide most of the statistics needed in our work, al- collection and model tuning and verification through exper- though they can be obtained via some additional work (more iments. explanation later). A few publicly available blog datasets 4.1 Model Development like the BuzzMetric dataset32 were designed for different re- search experiments so there is no way to obtain some key Assuming the domain to be community or multi-authored statistics required in this work. blogs, the influential bloggers are defined as: A blogger can be influential if s/he has more than one influential blog post. Therefore, we crawled a real-world blog site that provides The model assigns an influence score to each blog post of the most statistics required in our experiments. The ad- the blogger. These blog post level influence scores are used vantages of doing so include (1) minimizing our effort on to calculate the influence of the blogger. figuring out ways to obtain the needed statistics, and (2) maximizing the reproducibility of our experiments indepen- An initial set of intuitive properties is proposed in [4] to dently. The Unofficial Apple Weblog (TUAW) site is such approximately represent influential blog posts. a site that satisfies these requirements. This blog site pro- • Recognition - An influential blog post is recognized vides most needed information like blogger identification, by many. This can be equated to the case that an date and time of posting, number of comments, and out- influential post p is referenced in many other posts, links. The only missing piece of information at TUAW is or its number of inlinks (ι) is large. The influence of the inlinks information, which we can obtain using Techno- those posts that refer to p can have different impact: rati API33. We crawled the TUAW blog site and retrieved the more influential the referring posts are, the more all the blog posts published since it was set up. We have influential the referred post becomes. collected over 10, 000 posts so far34. We keep the complete history of the TUAW blog site and update it incrementally. • Activity Generation - A blog post’s capability of gener- All the statistics obtained after crawling is stored in a rela- ating activity can be indirectly measured by how many tional database for fast retrieval later35. comments it receives, the amount of discussion it initi- ates. In other words, few or no comment suggests little interest of fellow bloggers, thus non-influential. Hence, 4.3 Verification a large number of comments (γ) indicates that the post Many blog sites publish a list of top bloggers based on their affects many such that they care to write comments, activities on the blog site. The ranking is often made ac- and therefore, the post can be influential. There are cording to the number of blog posts each blogger submitted increasing concerns over spam comments that do not over a period of time. Using the number of posts of a blogger add any value to the blog posts or blogger’s influence. 32 Fighting spam is outside the scope of this work and http://www.nielsenbuzzmetrics.com/cgm.asp 33 recent research can be found in [33; 40]. http://technorati.com/developers/api/cosmos.html 34January 31, 2007. • Novelty - Novel ideas exert more influence as suggested 35This dataset will be made available upon request for re- in [30]. Hence, the number of outlinks is an indicator search purposes.

SIGKDD Explorations Volume 10, Issue 1 Page 26

Top 5 TUAW Bloggers Top 5 Influential Bloggers Erica Sadun Erica Sadun Scott McNulty Dan Lurie Mat Lu David Chartier David Chartier Scott McNulty Michael Rose Laurie A. Duncan Jun 2007 May 2007 Apr 2007 Mar 2007 Feb 2007 Jan 2007 All-in 14 16 12 15 10 12 Table 2: Two lists of the top 5 bloggers according to TUAW o Inlinks 3 4 3 3 1 0 and proposed model, respectively. o Comments 8 8 5 4 5 4 o Outlinks 11 8 5 4 4 7 Bloggers Active Inactive o Blog post length 12 14 11 15 9 10 Influential S1: 17 S2: 7 Non-influential S3: 3 S4: 0/1 Table 4: Overlap between Top 20 blog posts at Digg and our model for last 6 months for different configurations. Table 3: Intersection of Digg and top 20 from our model.

the blog posts of the four types. Digg provides an API to extract data from their database for a window of 30 days. posted is obviously an oversimplified indicator, which basi- Given the nature of Digg, a not-liked blog post will not be cally says the most frequent blogger is an influential one. submitted thus will not appear in Digg. For January 2007, Such a status can be achieved by simply submitting many there were in total 535 blog posts submitted on TUAW. As posts, as even junk posts are counted. Hence, an active blog- Digg only returns top 100 voted posts, we use these 100 blog ger may not be an influential one; and in the same spirit, posts at Digg as our benchmark in evaluation. an influential blogger need not be an active one. In other We take the four categories of bloggers, viz. 1. Active and words, the most active k bloggers are not necessarily the Influential, 2. Inactive and Influential, 3. Active and Non- top influential one, and an inactive blogger can still be an influential, and 4. Inactive and Non-influential and catego- influential one. rize their posts into S1, S2, S3, and S4, respectively. We Table 2 presents two lists of top 5 bloggers according to rank the blog posts of each category based on the influence TUAW and based on the proposed model: the first column score and pick top 20 blog posts from each of the first three contains the top 5 bloggers published by TUAW and the categories. We randomly pick 20 blog posts from the last second column lists the top 5 influential bloggers. Names category in which bloggers are neither active nor influential. in italics are the bloggers present in both lists. Three out Next we compare these four sets of 20 blog posts with the of 5 TUAW top bloggers are also among the top 5 influen- Digg set of 100 blog posts to see how many posts in each tial bloggers identified by our model. This set of bloggers set also appear in the Digg set. The results are shown in suggests that some of the bloggers can be both active and in- Table 3. From the table, we can see that S1 has 17 out of fluential. Some active bloggers are not influential and some 20 in the Digg set, and S4 has 0 or 1 found in the Digg set influential bloggers are not active. For instance, ‘Mat Lu’ depending on randomization. The results show the differ- and ‘Michael Rose’ in the TUAW list, so they are active; ences among the four categories of bloggers and our model and ‘Dan Lurie’ and ‘Laurie A. Duncan’ in the list of the identifies the influentials whose blog posts are more liked influentials, but they are not active. than others according to Digg. In total, there could be four types of bloggers: both active We also studied the contribution of different parameters and and influential, active but non-influential, influential but in- their relative importance. Since Digg assigns score to blog active, inactive and non-influential. posts and not bloggers, the top influential blog posts from As we know, there is no training and testing data to evaluate Digg37 are compared with those of our model. Different con- the efficacy of the proposed model. The absence of ground figurations of the proposed model was tried by considering truth about influential bloggers presents another challenge. 1. All-in i.e. all the four parameters, 2. No inlinks (out- The key issue is how to find a reasonable reference point links, comments, and blog post length), 3. No comments for which four different types of bloggers can be evaluated (inlinks, outlinks, blog post length), 4. No outlinks (inlinks, so that we can observe their tangible differences. As an al- comments, blog post length), and 5. No blog post length (in- ternative to the ground truth, we resort to another Web2.0 links, outlinks, comments). The overlap results for all these 36 site Digg to provide a reference point. According to Digg, 5 configurations with digg were reported in Table 4. From “Digg is all about user powered content. Everything is the results in Table 4, it can be observed that configuration submitted and voted on by the Digg community. Share, 2 (no inlinks) always performs the worst, configuration 3 discover, , and promote stuff that’s important to (no comments) performs better, then comes configuration 4 you!”. As people read articles or blog posts, they can give (no outlinks) and then come configuration 5 (no blog post their votes in the form of digg and these votes are recorded length). This gives the order of importance of all the four on Digg servers. This means, blog posts that appear on Digg parameters, i.e. inlinks > comments > outlinks > blog post are liked by their readers. The higher the digg score for a length, in the decreasing order of importance to influence blog post is, the more it is liked. In a way, Digg can be estimation. considered as a large online user survey. Though only sub- For a blog site that has a reasonably long history, we can mitted blog posts are voted, Digg offers a way to evaluate also study the temporal patterns of its influential bloggers. 36http://www.digg.com/ 37This data was obtained using digg API.

SIGKDD Explorations Volume 10, Issue 1 Page 27 The blog site TUAW provides blogging data since its in- posts, applicability of such techniques is limited if not much ception February 2004. The proposed model is applied to information epidemics is found. identify top 5 influential bloggers with a moving 30-day win- Another significant difference between social friendship net- dow until January 2007, and there is no overlap between two works and the blogosphere lies in the way influential mem- consecutive windows. Details could be found in [4]. Based bers are perceived. Bloggers submit blog posts which are on the results obtained by studying the temporal patterns the main source of their influence. Influence score could be of influential bloggers we categorize these bloggers into one computed using blog posts through several measures like in- of the following: links, outlinks, comments, and blog post length. This could give us an actual influential node based on the historical Long-term influentials They steadily maintain the sta- data of who influenced whom. Whereas members of a so- tus of being influential for a very long time. They can cial friendship networks do not have such a medium through be considered “authority” in the community. which they can assert their influence. The link information available on a social friendship network and other network Average-term influentials They maintain their influence centrality measures will just tell us the connectedness of status for 4-5 months. a node which could be used to gauge the spread of influ- ence rather than the influential node itself. Hence, there are Transient influentials They are influential for a very short works that measure the spread of influence through a node time period (only one or two months). in social friendship networks [13; 31; 46]. A node that max- imizes the spread of influence or who has a higher degree of Burgeoning influentials They are emerging as influential connectivity is chosen for . It is entirely pos- bloggers recently. They are the influentials worthy of sible that this node is connected to a lot of people but may more follow-up examinations. not be the one who could influence other members. Bloggers spread their influence through blog posts. This information Detailed results, analysis, and experiments like weight and source could be tapped to compute influence of a blogger stability analysis, pairwise correlation study between differ- using several measures like inlinks, outlinks, comments, and ent statistics, lesion study, and inclusion of other statistics blog post length. could be found in [4]. In a broader sense, influential nodes identified through the blogosphere are the ones who have “been influencing” fel- 5. BLOGOSPHERE AND SOCIAL low bloggers, whereas influential nodes identified through NETWORKS social friendship networks are the ones who “could influ- ence” fellow members. The reason is trivial: in the blogo- Social networks encompass services like the blogosphere, so- sphere we have the history of who influenced whom through cial friendship networks, media sharing (pictures e.g. , their blog posts, but in social friendship networks we don’t videos e.g. YouTube), collaborative annotation environment have such information. We only know who is linked to whom (del.icio.us), and wikis. Social Friendship Networks and Bl- and the one who is the most linked could be used to spread ogosphere share some commonalities like social collabora- the influence. But we don’t know whether he is the right tion, sense of community and experience sharing, yet there person to do that job. Using [28], authors model the blo- are some subtle differences. These nuances are worth point- gosphere as a social friendship network and then apply the ing out here as they shed light on different ongoing research existing works for mining influence in social friendship net- activities. There are some research areas that are specific works, but they lose essential statistics about the blog posts to social friendship networks like collaborative recommenda- like inlinks, outlinks, comments, blog post quality, etc. tion, trust and reputation because they assume an explicit A graph structure is strictly defined in social friendship net- graph structure in the interaction among different members works whereas it is loosely defined in Blogosphere. Nodes of the network. Similarly, some of the research areas are are members or actors in a social friendship network but they specific to blogosphere like blog post/blog site classification, could be bloggers, blog posts or blog sites. Social friendship spam blog identification because of the highly textual con- networks are predominantly used for being in touch or mak- tent nature of these articles. ing friends in society, while the main purpose of Blogosphere Unlike social friendship networks, the blogosphere does not is to share ideas and opinions with other members of the have explicit links or edges between the nodes. These nodes community or other bloggers. This gives a more community could be friends in a social friendship network and bloggers experience to Blogosphere as compared to a more friend- in the blogosphere. We could still construct a graph struc- ship oriented environment in social friendship networks. We ture in the blogosphere by assuming an edge from one blog- could observe person-to-community interactions in the blo- ger to another if a blogger has commented on other blogger’s gosphere, whereas one would observe more person-to-person blog post. This way we can represent the blogosphere with interactions in social friendship networks. Another signifi- an equivalent directed graph. Social friendship networks al- cant difference between the blogosphere and social friend- ready have predefined links or edges between the members 38 ship networks is in estimating the reputation/trust of mem- in the form of a FOAF network, an undirected graph. This bers. Member’s reputation/trust in the blogosphere is based link/edge inference also poses major challenges in majority on the response to other member’s solicitations for advice. of research efforts going on in the blogosphere like identify- Member’s reputation/trust in a social friendship network is ing communities and influential bloggers. Although Adar et based on the network connections and/or locations in the al. [1] proposed a model to infer links between different blog network. posts based on the propagation of the content in the blog We illustrate the differences mentioned above between social 38Friends of a friend friendship networks and the blogosphere in Figure 1. The

SIGKDD Explorations Volume 10, Issue 1 Page 28

Social Networks 7. ACKNOWLEDGMENTS

Orkut, Faceb ook, LinkedIn, This work is in part supported by AFOSR and ONR grants Classmates.com , etc. to the second author.

Social LiveJournal, MySpace , etc. Friendship Blogosphere 8. REFERENCES Networks TUAW, Blogger , , etc. [1] E. Adar, L. Zhang, L. Adamic, and R. Lukose. Implicit structure and the dynamics of blogspace. In Proceedings of the 13th International World Wide Web Conference, 2004. Figure 1: Social Friendship Networks and Blogosphere con- [2] Nitin Agarwal, Magdiel Galan, Huan Liu, and Shankar stitute part of Social Networks. Subramanya. Clustering blogs with collective wisdom. In Proceedings of the International Conference on Web Engineering, 2008. complete set of social interacting services can be represented by Social Networks of which Social Friendship Networks and [3] Nitin Agarwal, Huan Liu, John J. Salerno, and Philip S. Blogosphere are focal parts. There are certain social friend- Yu. Searching for Familiar Strangers on Blogosphere: ship networking websites like Orkut, Facebook, LinkedIn39, Problems and Challenges. In NSF Symposium on Next- Classmates.com that strictly provide friendship networks. Generation Data Mining and Cyber-enabled Discovery People join these networks to expand their social networks and Innovation (NGDM), 2007. and keep in touch with colleagues. However, these web- sites enforce strict interaction patterns among friends and [4] Nitin Agarwal, Huan Liu, Lei Tang, and Philip S. do not support flexible community structures as supported Yu. Identifying the influential bloggers. In Procced- by blog sites. On the other hand blog sites like TUAW, ings of the First ACM International Conference on Blogger40, Windows Live Spaces41 allow members to ex- Web Search and Data Mining (Video available at: press themselves and share ideas and opinions with other http://videolectures.net/wsdm08 agarwal iib/), 2008. community members. However, these websites do not facili- tate private friendship networks. Two members have to use [5] G. Attardi and M. Simi. Blog mining through opin- other communication channels (e.g. email) to communicate ionated words. In Proceedings of the fifteenth Text RE- between themselves, privately. But websites like LiveJour- trieval Conference (TREC), 2006. nal and MySpace provide both social friendship networks [6] A. L. Barabasi and R. Albert. Emergence of scaling in and blogging capabilities to their members. Clearly based random networks. Science, 286(509), 1999. on the characteristics of social interactions one could ob- serve overlap between social friendship networks and the [7] A. Blanchard. Blogs as virtual communities: Identi- blogosphere as depicted in Figure 1. fying a sense of community in the julie/julia project. Into the Blogosphere: Rhetoric, Community and Cul- ture.http://blog.lib.umn.edu/blogosphere, 2004. 6. CONCLUSIONS Blogosphere is one of the fastest growing, social networking [8] A. Blanchard and M. Markus. The experienced sense media. The virtual communities in the blogosphere are not of a virtual community: Characteristics and processes. constrained by physical proximity and allow anytime, any- The DATA BASE for Advances in Information Sys- where, and instant communications. In this paper we discuss tems, 35(1), 2004. current research issues in Blogosphere including modeling, blog clustering, blog mining, community discovery and fac- [9] A. Blum, T. H. C. Mugizi, and M. R. Rweban- torization, influence and propagation, trust and reputation, gira. A random-surfer web-graph model. In Third and filtering spam blogs. We also present research method- Workshop on Analytic Algorithmics and Combinatorics including centrality measures, network models (scale- (ANALCO06), 2006. free, random, preferential attachment, hybrid, and cascade [10] Sergey Brin and Lawrence Page. The anatomy of a models), text and link analysis, decision theory, and agent- large-scale hypertextual Web search engine. Computer based modeling. We enlist ways to obtain datasets, visualize Networks and ISDN Systems, 30(1–7):107–117, 1998. and study those using available tools. We also present a case study to exemplify how major aspects discussed in the paper [11] Christopher H. Brooks and Nancy Montanez. Improved are integrated to develop an application for identifying influ- annotation of the blogosphere via autotagging and hi- ential bloggers at a community blog site. Understanding the erarchical clustering. In WWW ’06: Proceedings of close relation of the blogosphere with the social networks, the 15th international conference on World Wide Web, we compare and contrast these two environments and point pages 625–632, New York, NY, USA, 2006. ACM Press. out subtle yet significant differences that lead to different treatment for burgeoning researches in these environments. [12] Alvin Chin and Mark Chignell. A social hypertext model for finding community in blogs. In HYPERTEXT 39http://www.linkedin.com/ ’06: Proceedings of the seventeenth conference on Hy- 40http://www2.blogger.com/home pertext and hypermedia, pages 11–22, New York, NY, 41http://spaces.live.com/ USA, 2006. ACM Press.

SIGKDD Explorations Volume 10, Issue 1 Page 29 [13] Thayne Coffman and Sherry Marcus. Dynamic classi- [28] Akshay Java, Pranam Kolari, Tim Finin, and Tim fication of groups through social network analysis and Oates. Modeling the spread of influence on the blogo- hmms. In Proceedings of IEEE Aerospace Conference, sphere. In Proceedings of the 15th International World 2004. Wide Web Conference, 2006.

[14] Scott Deerwester, Susan T. Dumais, George W. Furnas, [29] Anubhav Kale, Amit Karandikar, Pranam Kolari, Ak- Thomas K. Landauer, and Richard Harshman. Indexing shay Java, Tim Finin, and Anupam Joshi. Modeling by latent semantic analysis. Journal of the American trust and influence in the blogosphere using link polar- Society for information science, 1990. ity. In International Conference on Weblogs and , 2007. [15] Daniel Drezner and Henry Farrell. The power and poli- tics of blogs. In American Political Science Association Annual Conference, 2004. [30] Ed Keller and Jon Berry. One American in ten tells the other nine how to vote, where to eat and, what to buy. [16] L. Efimova and S. Hendrick. In search for a virtual set- They are The Influentials. The Free Press, 2003. tlement: An exploration of weblog community bound- aries, 2005. [31] David Kempe, Jon Kleinberg, and Eva Tardos. Maxi- mizing the spread of influence through a social network. [17] T. Elkin. Just an online minute... online forecast. In Proceedings of the KDD, pages 137–146, New York, http://publications.mediapost.com/index.cfm?fuseaction NY, USA, 2003. ACM Press. =Articles.showArticle art aid=29803. [32] J. Kleinberg. Authoritative sources in a hyperlinked en- [18] Thomas L. Friedman. The World Is Flat: A Brief His- vironment. In 9th ACM-SIAM Symposium on Discrete tory of the Twenty-First Century. Farrar, Straus and Algorithms, 1998. Giroux, 2005.

[19] Michael Gamon, Anthony Aue, Simon Corston-Oliver, [33] P. Kolari, T. Finin, and A. Joshi. SVMs for the bl- and Eric Ringger. Pulse: Mining Customer Opinions ogosphere: Blog identification and splog detection. from Free Text. In Proceedings of the 6th International In AAAI Spring Symposium on Computational Ap- Symposium on Intelligent Data Analysis, 2005. proaches to Analyzing Weblogs, 2006.

[20] Kathy E. Gill. How can we measure the influence of the [34] P. Kolari, A. Java, T. Finin, T. Oates, and A. Joshi. blogosphere? In Proceedings of the WWW’04: work- Detecting spam blogs: A machine learning approach. shop on the Weblogging Ecosystem: Aggregation, Anal- In Proceedings of the 21st National Conference on Ar- ysis and Dynamics, 2004. tificial Intelligence (AAAI), 2006.

[21] Dan Gillmor. We the Media: Grassroots Journalism by [35] Apostolos Kritikopoulos, Martha Sideri, and Iraklis the People, for the People. O’Reilly, 2006. Varlamis. Blogrank: ranking weblogs based on connec- tivity and similarity features. In AAA-IDEA ’06: Pro- [22] Jennifer Golbeck and James Hendler. Inferring binary ceedings of the 2nd international workshop on Advanced trust relationships in web-based social networks. ACM architectures and algorithms for internet delivery and Trans. Inter. Tech., 6(4):497–529, 2006. applications, page 8, New York, NY, USA, 2006. ACM [23] Jacob Goldenberg, Barak Libai, and Eitan Muller. Talk Press. of the network: A complex systems look at the un- derlying process of word-of-mouth. Marketing Letters, [36] R. Kumar, P. Raghavan, S. Rajagopalan, and 12:211–223, 2001. A. Tomkins. Trawling the web for emerging cyber com- munities. In The 8th International World Wide Web [24] D. Gruhl, David Liben-Nowell, R. Guha, and Conference, 1999. A. Tomkins. Information diffusion through blogspace. SIGKDD Exploration Newsletter, 6(2):43–52, 2004. [37] Ravi Kumar, Jasmine Novak, Prabhakar Raghavan, and Andrew Tomkins. On the Bursty Evolution of [25] R. Guha, Ravi Kumar, Prabhakar Raghavan, and An- Blogspace. In Proceedings of the 12th international con- drew Tomkins. Propagation of trust and distrust. In ference on World Wide Web, pages 568–576, New York, WWW ’04: Proceedings of the 13th international con- NY, USA, 2003. ACM Press. ference on World Wide Web, pages 403–412, New York, NY, USA, 2004. ACM Press. [38] J. Leskovec, M. McGlohon, C. Faloutsos, N. Glance, [26] Z. Gyongyi, P. Berkhin, Hector Garcia-Molina, and and M. Hurst. Cascading behavior in large blog graphs. J. Pedersen. Link spam detection based on mass esti- In SIAM International Conference on Data Mining, mation. In Proceedings of the 32nd International Con- 2007. ference on Very Large Data Bases (VLDB), 2006. [39] Beibei Li, Shuting Xu, and Jun Zhang. Enhancing clus- [27] Z. Gyongyi, H. Garcia-Molina, and J. Pedersen. Com- tering blog documents by utilizing author/reader com- bating web spam with trustrank. In Proceedings of ments. In ACM-SE 45: Proceedings of the 45th annual the 30th International Conference on Very Large Data southeast regional conference, pages 94–99, New York, Bases (VLDB), 2004. NY, USA, 2007. ACM Press.

SIGKDD Explorations Volume 10, Issue 1 Page 30 [40] Yu-Ru Lin, Hari Sundaram, Yun Chi, Junichi Tate- mura, and Belle L. Tseng. Splog detection using self- similarity analysis on blog temporal dynamics. In Pro- ceedings of the 3rd international workshop on Adversar- ial information retrieval on the web (AIRWeb), pages 1–8, New York, NY, USA, 2007. ACM Press. [41] Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents and Usage Data. Springer, 2006. [42] A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In Proceedings of the 15th international conference on World Wide Web (WWW), 2006. [43] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The citation ranking: Bring- ing order to the web. Technical report, Stanford Digital Library Technologies Project, 1998. [44] David M. Pennock, Gary W. Flake, Steve Lawrence, Eric J. Glover, and C. Lee Giles. Winners don’t take all: Characterizing the competition for links on the web. Proceedings of the National Academy of Sciences, 99(8):5207–5211, 2002. [45] Josep M. Pujol, Ramon Sangesa, and Jordi Delgado. Extracting reputation in multi agent systems by means of social network topology. In Proceedings of the first international joint conference on Autonomous agents and multiagent systems (AAMAS), pages 467–474, New York, NY, USA, 2002. ACM Press. [46] Matthew Richardson and Pedro Domingos. Mining knowledge-sharing sites for viral marketing. In Proceed- ings of the eighth ACM SIGKDD international - ence on Knowledge Discovery and Data mining, pages 61–70, New York, NY, USA, 2002. ACM Press. [47] Jordi Sabater and Carles Sierra. Reputation and so- cial network analysis in multi-agent systems. In AA- MAS ’02: Proceedings of the first international joint conference on Autonomous agents and multiagent sys- tems (AAMAS), pages 475–482, New York, NY, USA, 2002. ACM Press. [48] Loren Terveen and David W. McDonald. Social match- ing: A framework and research agenda. ACM Trans. Comput.-Hum. Interact., 12(3):401–434, 2005. [49] D. J. Watts and S. H. Strogatz. Collective dynamics of ‘small-world networks. Nature, 393(6684):440442, 1998. [50] Bin Yu and Munindar P. Singh. Detecting deception in reputation management. In Proceedings of the second international joint conference on Autonomous Agents and Multiagent Systems (AAMAS), pages 73–80, New York, NY, USA, 2003. ACM Press.

SIGKDD Explorations Volume 10, Issue 1 Page 31