Discovering Important Bloggers based on Analyzing Threads

Shinsuke Nakajima Junichi Tatemura Yoichiro Hino National Institute of NEC Laboratories America,Inc Dept. of Social Informatics, Information and 10080 North Wolfe Road, Kyoto University Communications Technology Suite SW3-350, Cupertino, CA Yoshida Honmachi, Sakyo-ku 3-5 Hikari-dai Seika-cho 95014, USA Kyoto, Japan Soraku-gun Kyoto, Japan [email protected] [email protected] [email protected] labs.com u.ac.jp

Yoshinori Hara Katsumi Tanaka Internet Systems Research Dept. of Social Informatics, Laboratories, NEC Kyoto University Corporation Yoshida Honmachi, Sakyo-ku 8916-47, Takayama-cho, Kyoto, Japan Ikoma, Nara, Japan [email protected] [email protected] u.ac.jp

ABSTRACT 1. INTRODUCTION A blog (weblog) lets people promptly publish content (such The broadband infrastructure and ubiquitous computing as comments) relating to other through . have created an environment in which people are continu- This type of web content can be considered as a conversa- ally online on the (WWW). Given this tion rather than a collection of archived document. To cap- environment, people are increasingly publishing their reac- ture ’hot’ conversation topics from blogs and deliver them tions (e.g., comments and opinions) to current events (e.g., to users in a timely manner, we propose a method of dis- news). Users may state their opinion of a current news ar- covering bloggers who take an important role in conversa- ticle, followed by other users who react to their opinions tions. We characterize bloggers based on their roles in pre- by stating a different opinion. In this sense, the web can vious blog threads (a set of blog entries comprises a con- be seen as a place for conversation rather than for archived versation),. We provide several definitions of bloggers’ roles documents. Triggered by an event, a hot conversation may including (1) agitators who stimulate discussion, and (2) quickly propagate from one site to another through the web. summarizers who provide summaries of the discussion. We A weblog, or blog for short, is a tool or web site that consider that these bloggers are likely to be useful in iden- enables people to publish content promptly. The “Glossary tifying hot conversations. In this paper, we discuss models of Internet Terms [1]” says that of blogs and blog thread data, and methods of extracting blog threads, discovering important bloggers, and acquiring “A blog is basically a journal that is available on the web. important content from their entries. The activity of updating a blog is “blogging” and someone who keeps a blog is a “.” Blogs are typically updated Categories and Subject Descriptors daily using software that allows people with little or no tech- nical background to update and maintain the blog. Postings H.4.3 [Information Systems]: Information Systems Ap- on a blog are almost always arranged in chronological order plicationsCommunications Applications; H.2.8 [Information with the most recent additions featured most prominently.” Systems]: ManagementDatabase Applications; J.4 [Computer Applications]: Social and behavioral sci- A blog entry, a primitive entity of blog content, typically ences has links to web pages or other blog entries, creating a con- versational web through multiple blog sites. General Terms Since conventional search engines treat the web as a snap- shot of hyperlinked documents, they are not very effective Human Factors for capturing conversational web content such as blogs. A new approach is required for timely delivery of hot conver- Keywords sations over multiple blogs on the web. blog thread, important blogger, agitator To capture potentially hot conversations on the web, we Copyright is held by the author/owner(s). propose a method for discovering bloggers who take impor- WWW2005, May 10–14, 2005, Chiba, Japan. tant roles in these conversations. This information on blog- . RSS URL micro behavior of blogspace. In particular, they focused on not only the explicit link structure but also the implicit Blog site Blog site routes of transmission for finding blogs that are sources of PermaLink information. Blog entry A news site However, their purpose was not to acquire important web Blog entry Blog entry content. There are numerous reports of studies on topic detection 䋺 PermaLink A Web site 䋺 and tracking[5]. In particular, there have been several stud- Blog entry ies on first story detection (FSD). The goal of FSD is to Blog site recognize when a new topic appears that has not been pub- Reply Link PermaLink Blog entry lished previously. In this paper, we adopt a similar technique Blog entry to discriminate agitators (aspect 3 (5.2)). Allan et al. stud- Blog entry ied first story detection[6]. According to this study, when a 䋺 䋺 䋺 䋺 new story arrives, its feature set is compared to those of all 䋺 䋺 past stories. If it is sufficiently different, the story is marked as a first story; otherwise, it is not. Figure 1: Typical blog site. Though these previous studies of FSD are relevant to us, we cannot adopt these methods directly since blog threads include relations between entries based on replylinks, which ger characteristics can then be used to acquire important are different from a simple news stream. hot conversations Figure 1 shows a schematic of a typical blog site. A blog 3. BLOGS AND BLOG THREAD DATAMODEL site is usually created by a single owner/blogger and con- sists of his or her blog entries, each of which usually has There are several definitions of blogs other than the one a permalink (URL: uniform resource locator) to enable di- given in Section 1. Some are shown below. rect access to the entry. Blog readers can discover bloggers’ • From ”Web log:” A blog is basically a journal that characteristics (e.g., their interests, role in the community, is available on the web. The activity of updating a etc.) by browsing their past blog entries. If readers know blog is ”blogging” and someone who keeps a blog is a the characteristics of a particular blog, they can expect sim- ”blogger.” [7] ilar characteristics to appear in future entries in that blog. • Our goal is to develop a method of capturing hot conver- An easily updated personal , generally updated sations by automating readers’ processes for characterizing daily and expressing opinions. [8] and monitoring blogs. • Web pages that are constantly updated with new com- In our method, an important blogger is defined on the ba- mentary and links relating to a particular topic. Often sis of his or her role in a blog thread, i.e., a set of blog entries very personal. [9] comprising a conversation on a specific topic. We think it is likely that bloggers take various roles in a thread, including That is, a blog is a website that anybody can easily update acting as (1) an agitator who stimulates discussion, or (2) a and use to express his/her own opinions in a public space. summarizer who provides summaries of the discussion. We To put it another way, blogs are a storehouse of information believe that these three types are important bloggers who that reflects public opinion. Although there is a lot of trivial are useful for identifying hot conversations. information in blogspace, there is also a lot of important We first describe related work and describe blogs and a information. model of blog thread data, and then the extraction of blog Before defining our model of blog thread data , let us threads. This is followed by a discussion of how important discuss these definitions of blog sites and blog entries. Ex- bloggers are discovered, and examples of applications of our amples are shown in Figures 2 and 3. method. We end with a summary and outline our plans for + + future work. site =(siteURL, RSS, blogger , siteName, entry ) entry =(permaLink, blogger, time, title?, description, comment∗) 2. RELATED WORK comment =(blogger, permaLink, content, time) In related work on analyzing blogspace, Kumar et al. studied the burstiness of blogspace[2]. They examined 25,000 A blog site has a site URL, RSS (really simple syndica- blog sites and 750,000 links to the sites. They focused on tion), site name, and entries, and is managed by one or clusters of blogs connected via hyperlinks named blogspaces more bloggers. A blog entry has a permalink for access, a and investigated the extraction of blog communities and the publication time, title, and entry description. A comment evolution of the communities. includes the content of the comment and the time when it Gruhl et al. studied the diffusion of information through was written. A blogger posts a blog to an entry identified blogspace[3]. They examined 11,000 blog sites and 400,000 by a permalink. links in the sites, and tried to characterize macro topic- diffusion patterns in blogspaces and micro topic-diffusion replyLink =(ei,ej ), (ei → ej ) patterns between blog entries. They also tried to model trackbackLink =(ei,ej ), (ei → ej ) topic diffusion by means of criteria called Chatter and Spikes. sourceLink =(ei,wi), (ei → wi) Adar et al. studied the implicit structure and dynamics where ei,ej ∈ E , E is a set of blog entries, and wi ∈ W , of blogspace[4]. They also examined both the macro and W is a set of Web pages except blog entries. rootEntry replyLink SiteName blogger replyLink sourceLink Blog site Blog entry Blog site Blog site RSS SiteURL Blog entry Blog entry Blog site Web site Blog entry Blog site Blog site Blog site Blog entry Blog entry Blog entry

rootEntry Blog entry 䋺 Figure 4: Example of blog thread 䋺 䋺 䋺 rected connected graph and is defined as follows.

Figure 2: Example of blog site thread := (V, L) V = W ∪ E, L = Ls ∪ Lr W is a set of . E is a set of blog entries.   Site Ls ⊆{(e, e )|e ∈ E, e ∈ W }   blogger Lr ⊆{(e, e )|e ∈ E, e ∈ E} title Ls corresponds to a set of sourceLink. PermaLink Lr corresponds to a set of replyLink. time description Ideally, the entries in a blog thread should share common topics. However, topics sometimes change at a particular Blog entry comment point. In the future, we will separate a blog thread at that point even if both parts are connected via replyLinks.

sourceLink replyLink 4. EXTRACTION OF BLOG THREADS This section shows how blog threads are extracted from real blog data. Web site Blog entry (except blog entry) 4.1 Crawling through blog entries First, our system crawls through blog entries to extract Figure 3: Example of blog entry blog threads. 1. The system adds unregistered RSS feeds to the RSS list by crawling through files, replyLinks and sourceLinks are hyperlinks described in ”http://ping.bloggers.jp/opml.xml”, the description of a blog entry to other blog entries or web ”http://www.scripting.com/feeds/top100.opml” pages. They do not include automatically added hyperlinks, , and so on. such as links to previous entries, the next entry in the blog site, or other links unrelated to the content of the descrip- 2. The system crawls through RSS feeds registered on tion of the blog entry. This is because we want to remove the RSS list and registers the title, permalink, and list link noise to ensure that all blogspot.com pages point to entry date as ungained entry data in the RSS. blogger.com, etc. The method for removing link noise is de- scribed in Section 4.2. Here, a trackbackLink is a special 3. Return to 1. case of a replyLink. For trackbackLink =(ei,ej ), there is not only a replyLink The RSS corresponds to the RDF Site Summary, which is of (ei → ej ) but also a link of (ej → ei) to indicate the actually an extension of RDF (resource description frame- existence of a replyLink. work) language. RSS, which is an extensible de- An example of a blog thread is shown in Fig.4. We define a scription and syndication format, is an XML application blog thread as follows. A blog thread is composed of entries that conforms to the W3C’s RDF specification. These days, connected via replyLinks to a discussion among bloggers. most blog sites syndicate their content to subscribers by There is one exception. As Fig. 4 indicates, sets of entries means of an RSS. that are not connected to each other via replyLinks are re- OPML (outline processor markup language) is an XML garded as being the same thread if they refer to the same format for outlines. Originally developed as a native file website via a sourceLink. Comments attached a blog entry format for an outliner application it has since been adopted are not used in order to extract blog thread, because we want for other uses, the most common being to exchange lists of to identify important bloggers by analyzing blog thread so RSS feeds between RSS aggregators. that it is not very important of comment author whose blog Our system had registered about 400,000 RSS feeds (=blog site could not be identified. Namely, a blog thread is a di- sites) and over 10,000,000 entries as of January 15, 2005. 4.2 Extracting hyperlinks from descriptions 㪋㪇 of blog entries We need to extract the hyperlinks from descriptions of 㪊㪇 blog entries to discover possible connections between the en- try and other web pages (including blog entries). Therefore, we have to be able to recognize the scope of the description of an entry, based on an analysis of the HTML . How- 㪉㪇 ever, each blog site server has its own tag structure so we need to set up rules for analyzing the tag structure of each blog site server that we want to analyze. Our target blog 㪈㪇 sites are limited to famous blog-hosting sites and some fa-

mous bloggers’ sites because naturally we are unable to set A in a thread blog entries of number up rules for every blog site. We therefore set up rules for 㪇 㪉㪇㪇㪊㩷㪛㪼㪺㪅㩷㪉㪌 㪉㪇㪇㪋㩷㪡㪸㫅㪅㩷㪐 㪉㪇㪇㪋㩷㪡㪸㫅㪅㩷㪉㪋 㪉㪇㪇㪋㩷㪝㪼㪹㪅㩷㪏 㪉㪇㪇㪋㩷㪝㪼㪹㩷㪉㪊 analyzing the tag structure of about 25 famous hosting sites and some famous bloggers’ sites. This enabled us to remove Figure 5: Example of time-series data of entries in link noise from the replyLinks and sourceLinks. a blog thread. The procedure for extracting hyperlinks from blog entries is given below. rootEntry 1. The system crawls through the permalinks of entries that have not been crawled through in the entry list, and obtains the entries. 2. Descriptions of the entries are extracted from the HTML text by analyzing the tag structure. 3. Hyperlinks are extracted from the description and added to the list of links. 4.3 Extraction of blog threads A blog thread is a set of entries connected to each other via replylinks and referring to a common web page via a sourcelink, as stated in section 3 (see Fig. 4). The procedure for extracting blog threads is given below. 1. The system judges whether each in the link list is a replyLink or sourcelink by checking whether Figure 6: Example of link structure of a blog thread. the destination URL of the hyperlink appears in the entry list. 5.1 Important bloggers in blog threads 2. If it is a replyLink, the departure and destination We define several potentially important bloggers in a con- of the replyLink are checked to see whether or not they versation as follows: are registered in the existing thread data. If they are, they are added to the existing thread data. They be- 1. Agitator come elements of a new thread if they do not. An Agitator often stimulates the discussion in a blog 3. If it is a sourceLink, the departure URL of the sourceLink thread so that it becomes more active. Thus, we may is checked to see whether or not it is registered in the be able to predict whether a blog thread will grow by existing thread data, and the destination URL of the watching an Agitator’s an entry. If the system sta- sourceLink is checked to see whether it is consistent tistically judges that threads often grow just after a with the Web page URL referred to by an entry reg- particular blogger has published an entry, then that istered in an existing thread. If there is an existing blogger is judged to be an Agitator. thread to be entered, it is added to the existing thread 2. Summarizer data. If not, it becomes an element of a new thread. A Summarizer often publishes an entry that summa- 4. Return to 1. rizes a hot blog thread by referring to many other en- The extracted thread data represents sets of entries, each tries. Thus, we may be able to obtain a summary of a with a date and link data. Consequently, the system can blog thread that include hot topic by watching the en- analyze the time-series data fir the entries in a thread (see tries of Summarizers. If the system statistically judges Fig. 5) and the link structure of a blog thread (see Fig. 6). that a particular blogger publishes entries that often In Fig. 6, each circle corresponds to a blog entry. each refer to other entries, then that blogger is judged to arrow corresponds to hyperlink between entries. be a Summarizer. In this paper, we focus on how to define and find an agita- 5. DISCOVERING IMPORTANTBLOGGERS tor and a summarizer, because we are interested in discover- This section explains how to identify important bloggers. ing important, hot blog conversations on a specific topic and we believe that we can discover important blog conversation mx is the number of entries in threadi that were pub- and its summary about a specific topic as blog threads by lished in the t (days) before ex. watching agitators and summarizers on specific topics. 5.2 Discriminants for Agitator Therefore, (lx/mx) corresponds to an approximation of a second derivative value. In this section, we introduce three aspects that charac- terize a blogger as an agitator. Given a blog thread, the • Aspect 3: topic-based discriminant following aspects discriminate an entry ex from the other Entries by an agitator often have different character- entries in a blog thread. We can then characterize the blog- istics in terms of content from entries in threadi that gers who publish such entries by aggregating the values from were published before ex. In addition, they often have multiple blog threads. similar characteristics in terms of content to entries of • Aspect 1: link-based discriminant threadi that were published after ex. Namely, an agi- tator may have a big impact on a blog thread and may An entry by an agitator is characterized by the number cause it to change. of links to an entry from other entries. That is, an agitator is a blogger who is popular with other bloggers ex, an entry by an agitator, is identified based on the

on a topic. following discriminant.

´ 

ex, an entry by an agitator is identified based on the 

¡

following discriminant. 1 · Èx+n

Similarity n x+1 ei ,ex

µ

 

¡ (kx) >θ1, − 1 · Èx−n Similarity ex, n x−1 ei >θ3,

where kx is the number of entries in threadi that have a replylink to ex. where ex−n is a feature vector of the nth latest entry in threadi before ex was published and ex+n is a feature vector of the nth earliest entry in • Aspect 2: popularity-based discriminant threadi after ex was published. An entry by an agitator is characterized by a dramatic increase in the popularity of a thread shown by the In this paper, feature vectors are calculated by means number of entries published just after the agitators’s of the TF(term frequency) values of the description of entry. a blog entry. Similarity between feature vectors de- notes text similarity calculated based on the cosine- correlation. Tangential line Therefore, we believe that another discriminant for an agitator is represented by the point at which the topic of a blog thread changes. Agitator:1 We proposed three aspects for discriminating agitators. Though a single aspect is not enough to judge if an entry is by an agitator or not, using the three discriminants together should enable us to indentify agitators. (a number of blog (a number entries) Popularity of a blog thread Popularity of

Time 5.3 Discriminant for Summarizer In this section, we introduce discriminant for Summarizer. Figure 7: Feature of agitator in time-series data for Given a blog thread, the following discriminant identifies an popularity of blog thread. entry ex from the other entries in a blog thread. Conse- quently, we characterizes bloggers who published such en- As shown in Fig. 7, the time-series data for the pop- tries by aggregating values from multiple blog threads. ularity of entries in a blog thread seem to reflect pe- • link-based discriminant riods of stagnation and increased activity. Therefore, we consider that an agitator’s entry appears between An entry of summarizer is characterized by the number the end of a stagnant period and the beginning of a of links from an entry to other entries.

period of an increased activity. ex, an entry of summarizer, is identified based on the ex, an entry by an agitator is identified using the fol- following discriminant. lowing discriminant.

(px) >θ4, (lx/mx) >θ2,

where px is the number of entries in threadi that have where lx is the number of entries in threadi that were a replylink from ex. published in the t (days) after ex and 5.4 Examination of Discriminants for Discov- 17 ering Agitators and Summarizers 29 19 We discuss the discriminants of aspects of agitator (5.2) 30 and discriminant of summarizer (5.3) by means of examina- 20 31 tion of real blog data. Examples of real blog thread data are 7 21 shown below. Fig. 8, 9 and 10 are data of thread (A). Fig. 8 14 22 32 11, 12 and 13 are data of thread (B). 2 9 24 33 12 15 34 25

14 10 6 12 5 15 11 26 18 3 13 5 16 27 7 16 2 1 10 17 6 8 23 3 18 28 9 4

1 11 13 4 Sep. 26 29Oct. 1 2 3 4

Figure 11: Example of link graph of the blog thread Sep. 20Sep. 25 Sep. 30 Oct. 5 Oct. 10 Oct. 15 (B). Figure 8: Example of link graph of a blog thread approximate second derivative value derivative approximate second 㪋㪇 㪋㪇 (A). A number of blog entries in a thread

approximate second derivative value 䋨 l x /m

approximate second derivative value derivative approximate second 㪊㪇

㪊㪇 x

㪉㪇 㪈㪇 2 Aspect 4.2 see , 䋨 A number of blog entries in a thread l x /m 㪉㪇 㪉㪇 㪈㪌 x e . set2 Aspect 4.2 see ,

㪈㪇 㪈㪇 㪈㪇 㪌 䋩 approximate second derivative value 㪇 㪇 㪌 A number of blog entries in a thread 㪪㪼㫇㪅㪊㪇 㪦㪺㫋㪅㪈 㪦㪺㫋㪅㪉 㪦㪺㫋㪅㪊 㪦㪺㫋㪅㪋 㪦㪺㫋㪅㪌 䋩 㪇 Figure 12: Time-series data of popularity and sec-

A number of blog entries in a thread 㪇 㪪㪼㫇㪅㩷㪉㪇 㪪㪼㫇㪅㩷㪉㪌 㪪㪼㫇㪅㩷㪊㪇 㪦㪺㫋㪅㩷㪌 㪦㪺㫋㪅㩷㪈㪇 㪦㪺㫋㪅㩷㪈㪌 ond derivative value of thread (B).

Figure 9: Time-series data of popularity and second derivative value of thread (A). horizontal axis indicates publishing date of entries. Each date is in 2004. The vertical axis has no meaning. There are some replyLinks from older entry to newer entry 㪉㪇 㪇㪅㪇㪈 A number of blog entries in a thread in Fig. 8 and Fig. 11. Generally, bloggers can edit their Degree of topic change entries after publishing them. However, the date of entry Degree of topic change Degree of topic 䋨 are not changed. That is the reason why such replylinks 㪈㪌 3 see 4.2 Aspect exist. We allow such hyperlinks from newer entry to older entry as replylink. 㪈㪇 㪇 As shown in Fig. 8, the entry of No. 5 of thread (A) seems something important in the thread, because it is cited from 㪌 䋩 9 other entries and entries in the thread increase just after published the entry of No. 5. Thus, the entry might be an 㪇 㪄㪇㪅㪇㪈 entry of agitator based on aspect 1 of agitator. However, A number of blog entries in a thread 㪪㪼㫇㪅㩷㪉㪇 㪪㪼㫇㪅㩷㪉㪌 㪪㪼㫇㪅㩷㪊㪇 㪦㪺㫋㪅㩷㪌 㪦㪺㫋㪅㩷㪈㪇 㪦㪺㫋㪅㩷㪈㪌 it is not always that many in-links mean good meanings. For example, some one may try to find fault with a blogger Figure 10: Time-series data of popularity and degree with referring via replyLink. Moreover, it is possible that an of topic-change of thread (A). entry stimulate some one to write another entry nevertheless he/she does not refer to the entry. Thus, link analysis is not In Fig. 8 and Fig. 11, each circle corresponds to a blog enough to judge agitator or not. entry. Each arrow corresponds to a replyLink between en- Next let us discuss discriminant of agitator and summa- tries. The number in the circles denote the number from rizer with time-series data of a blog thread shown in Fig. 9 the oldest date entry to the newest date entry. The thread The solid line in Fig. 9 corresponds to the time series data (A) has 18 entries, and the thread (B) has 34 entries. The of popularity of the thread (A). The circle in black corre- rsial,a h on fN.5a el fcus,teeis there course, Of well. ( since as increasing it, 5 is about No. popularity doubt of the no point before the just at term drastically, the entries. in blog high of are popularity of data series time value derivative the second a of of approximation an to corresponds ayohretisi h hed hs hyfil satisfy fairly they Thus, (5.3). thread. summarizer the of citing discriminant are in they entries because other important, many something seem also (B) aspect and condition. 2, following aspect the 1, agitators. aspect of for 3 discriminants the of part ee fsmo hs he r eadda agitator. as upper regarded the are normalized three is these First, of discriminant sum each data. of of blog level part real using left discriminants the three the uating is site blog whose about blogger trend famous IT is and of blogger 2 the topic 1, Acctually, aspect agitator. satisfies of entry 3 the since increasing agitator is of candidate the popularity at the high before become just change when topic According entry drastically. of high. 5 degree is the degree No. 10, the possibility if fair Fig. a thread to is of It cor- topic change. This changing for topic discriminant of 5.2. of Section degree the in the of explained to as part responds agitator, left and an in of the popularity line 3 to dashed aspect of The corresponds data (A). 10 thread time-series the Fig. the of shows topic-change of 10 degree Fig. value.5.2. objective means such by but is hot judgement it become human However, thread not the popularity. of when thread detect of to derivative important second a of and to corresponds line dashed The 5. No. ( of entry the to sponds (B). thread degree of and topic-change popularity of of data Time-series 13: Figure l

x A number of blog entries in a thread ssoni i.9 ecnseta h auso ( of values the that see can we 9, Fig. in shown As ssoni i.1,teetiso o 4ad2 fthread of 24 and 14 No. of entries the (B). 11, thread Fig. about in shown discuss As us let another, For S eval- by agitators identify to attempted we system, our In ”http://blog.japan.cnet.com/umeda/”. be to seems (A) thread of entry 5 No. of blogger Hence, section the in agitator of 3 aspect discuss will we Next, α S /m 㪈㪇 㪉㪇 㪊㪇 㪋㪇 㪈㪇 㪉㪇 㪊㪇 㪋㪇 A A 㪇 㪇 + Agi = x orsod otesoefra agitator. an for score the to corresponds β ,wihi xlie nScin52 aey ( Namely, 5.2. Section in explained is which ), A number of blog blog of number A α 3 + 㪼㪅㪇㪦㫋㪈㪦㫋㪉㪦㫋㪊㪦㫋㪋㪦㪺㫋㪅㪌 㪦㪺㫋㪅㪋 㪦㪺㫋㪅㪊 㪦㪺㫋㪅㪉 㪦㪺㫋㪅㪈 㪪㪼㫇㪅㪊㪇 㪼㪅㪇㪦㫋㪈㪦㫋㪉㪦㫋㪊㪦㫋㪋㪦㪺㫋㪅㪌 㪦㪺㫋㪅㪋 㪦㪺㫋㪅㪊 㪦㪺㫋㪅㪉 㪦㪺㫋㪅㪈 㪪㪼㫇㪅㪊㪇 · orsodt h vrg ftenraie left normalized the of average the to correspond γ Agi 0 (0 100 = α 1 , + β entries in entries a thread β and · Degree of topic change Agi ≤ γ S l 2 A x r egtn atr,adsatisfy and factors, weighting are /m + ≤ γ x 100) · savleo approximation of value a is ) Agi 3 Agi 1 l l x x ,

/m /m

㪄㪇㪅㪇㪈 㪇㪅㪇㪈 Agi 㪇

䋩 䋨

x x 3 Aspect 4.2 see

2 ) ) Degree of topic change topic of Degree .EAPEO APPLICATION OF EXAMPLE 6. meic hnaue rwe escontents. news browses with user information system a the supplementary when is important immediacy the which insert propose 14, can existed therefore Fig. that We have in shown not occurred. application may event following news information a the the on before because information the item engine, supplementary news search when pre-crawl Even a cannot as on. such so news system technique and related conventional blogger, thread, a important blog using an a by of to referred part entries bloggers, include important could opinions. by provided biased information with the them example, presenting For avoid is to it information audiences that supplementary feel to of We variety a topic. provide news to cover a important to cannot relating information they the immediacy, all with information important news browsing when immediacy content. with information mentary neae oteetycnet n oon. so and content, miswritten entry like the noises links to advertisement of , unrelated lot non-existing a to hyperlinks has files, data blog since data, blog threads dis- blog of multiple system on the analysis. based develop blogger should impor- important we of discover covering that to necessary so order is in bloggers, based it analysis tant not, Therefore, thread or blog thread. blogger statistical a one to important only an impossible analyzing is is on blogger it a summarizer, that and judge agitator of candidates their in better entries a other is browsing entry sites. by blog No.14 No.24 of than blogger actually. contents summarizer that these believe browsing I by Moreover, predict racing horse of the important and another topic-change. candidates find the summarizer of to degree hard of has thread is entries that it between term But relation in active. changed has more thread figure, been the the From of change. topic topic dashed of the The degree (B). a thread to the corresponds of line topic-change the of degree and of part hot hot, thread. very is blog thread summarize a may before published summarizer have great it it course, because if Of agitator active. be become No. cannot thread of the before entry not the published to corresponds gray of in entry circle the to 24. the of corresponds 14, popularity white in of No. circle data The series entries. approximate time blog an of to value corresponds derivative the line of second popularity dashed The of The (B). data (B). series thread thread time to of corresponds value line derivative solid second and popularity data. time-series in summarizer, of candidates 24, h prto fteapiainsse sotie below. outlined is system application the of operation The deliver may programs TV and sites news famous Though supple- important inserting of way a shows section This nadin ehv ocnie o ormv ossof noises remove to how consider to have we addtion, In discovering of examination the about discuss we Though topic a of summary described are entries these that found I popularity of data time-series the shows 13 Fig. Further, have 24 nad 14 No. of entries agitator, of candidate Unlike of data time-series the shows 12 Fig. 9, Fig. and as 14 same No. As of entries of behavior the examine us let Next, .Tesse dnie motn lgeso particular on bloggers important identifies system The 1. Insert important supplementary information • Develop the system of identifying important blogger by using important bloggers with immediacy. based on multiple blog threads analysis. • Design and develop a prototype system for providing important supplementary information with immediacy when browsing news content.

8. ACKNOWLEDGMENTS This research is partly supported by the research for a Grant of Scientific Research (16016247 and 14208036) from the Ministry of Education, Culture, Sports, Science and Technology of Japan, and the Informatics Research Center News site News TV program for Development of Knowledge Society Infrastructure(COE program of the Ministry of Education, Culture, Sports, Sci- ence and Technology, Japan). Figure 14: Example of applications inserting supple- mentary information with news contents. 9. REFERENCES [1] Matisse’s Glossary of Internet Terms topics. http://www.matisse.net/files/glossary.html [2] R. Kumar, J. Novak, P. Raghavan, A. Tomkins: ”On 2. When news contents is delivered, the system estimates the Bursty Evolution of Blogspace”, The Twelfth the topic of the news content. International World Wide Web Conference (2003). 3. The system crawls the blog data from important blog- http://www2003.org/cdrom/papers/refereed/p477/p477- gers related to the news content. kumar/p477-kumar.htm [3] D. Gruhl, R. Guha, D. Liben-Nowell, A. Tomkins: 4. The crawled blog data are categorized using a cluster- ”Information Diffusion Through Blogspace”, The ing method. Thirteenth International World Wide Web Conference (2004). 5. The system provides blog data from important blog- http://www2004.org/proceedings/docs/1p491.pdf gers that differs to some degree from the news content, [4] E. Adar, L. Zhang: ”Implicit Structure and Dynamics because data that is the same as news content is not of Blogspace”, WWW2004 Workshop on the useful as supplementary information. Weblogging Ecosystem: Aggregation, Analysis and In future work, we plan to work on identifying important Dynamics (2004). bloggers for various topics, detecting topics in news content, [5] J. Allan: ”Topic Detection and Tracking”, Kluwer crawling blog data from important bloggers, clustering blog Academic publishers (2002). data and presenting information that supplements news con- [6] J. Allan, V. Lavrenko, H. Jin: ”First Story Detection tent. In TDT Is Hard”, In Ninth International Conference on Information Knowledge Management (CIKM’2000) 7. CONCLUSIONS (2000). [7] MITSOL Help - Glossary we proposed a method for identifying important bloggers http://www.mitsol.co.za/help glossary.htm and acquiring important content from their blog entries. [8] eGlossary The results of this study can be summarized as follows: http://www.internettime.com/itimegroup/eglossary.htm 1. A blog data model, blog site model, blog entry model, [9] Glossary - Access eGovernment and blog thread model were defined. http://www.access-egov.info/glossary.cfm?xid=PA 2. We described a method of extracting blog threads, and extracted threads from real blog data ( more than 10,000,000 entries ) registered in our system. 3. We described a method for identifing agitators and summarizers as important bloggers by establishing dis- criminants for them. 4. We proposed a way of providing important supplemen- tary information with immediacy when browsing news content.

In addition, in future work we plan to:

• Investigate a suitable method for detecting the topic of blog contents to identify the area of expertise of important bloggers.