Bipolar Person Name Identification of Topic Documents Using

Bipolar Person Name Identification of Topic Documents Using Principal Component Analysis Chein Chin Chen Chen-Yuan Wu Department of Information Department of Information Management Management National Taiwan University National Taiwan University [email protected] [email protected] A topic consists of a sequence of related Abstract events associated with a specific time, place, and person(s) (Nallapati et al., 2004). Topics that In this paper, we propose an unsuper- involve bipolar (or competitive) viewpoints are vised approach for identifying bipolar often attention-getting and attract a large number person names in a set of topic documents. of topic documents. For such topics, identifying We employ principal component analysis the polarity of the named entities, especially per- (PCA) to discover bipolar word usage son names, in the topic documents would help patterns of person names in the docu- readers learn the topic efficiently. For instance, ments and show that the signs of the en- for the 2008 American presidential election, In- tries in the principal eigenvector of PCA ternet users can find numerous Web documents partition the person names into bipolar about the Democrat and Republican parties. groups spontaneously. Empirical evalua- Identifying important people in the competing tions demonstrate the efficacy of the parties would help readers form a balanced view proposed approach in identifying bipolar of the campaign. person names of topics. Existing works on topic content mining focus on extracting important themes in topics. In this 1 Introduction paper, we propose an unsupervised approach that With the advent of Web2.0, many online colla- identifies bipolar person names in a set of topic borative tools, e.g., weblogs and discussion fo- documents automatically. We employ principal rums are being developed to allow Internet users component analysis (PCA) (Smith, 2002) to dis- to express their perspectives on a wide variety of cover bipolar word usage patterns of important topics via Web documents. One benefit is that person names in a set of topic documents, and the Web has become an invaluable knowledge show that the signs of the entries in the principal base for Internet users to learn about a topic eigenvector of PCA partition the person names comprehensively. Since the essence of Web2.0 in bipolar groups spontaneously. In addition, we is knowledge sharing, collaborative tools are present two techniques, called off-topic block generally designed with few constraints so that elimination and weighted correlation coefficient, users will be motivated to contribute their know- to reduce the effect of data sparseness on person ledge. As a result, the number of topic docu- name bipolarization. The results of experiments ments on the Internet is growing exponentially. based on two topic document sets written in Research subjects, such as topic threading and English and Chinese respectively demonstrate timeline mining (Nallapati et al., 2004; Feng and that the proposed PCA-based approach is effec- Allan, 2007; Chen and Chen, 2008), are thus tive in identifying bipolar person names. Fur- being studied to help Internet users comprehend thermore, the approach is language independent. numerous topic documents efficiently. 170 Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pages 170–178, Beijing, August 2010 2 Related Work syntactic constructs generally express sentimen- tal semantics. In contrast, our method identifies Our research is closely related to opinion mining, the polarity of person names. Second, to the best which involves identifying the polarity (or sen- of our knowledge, all existing polarity identifica- timent) of a word in order to extract positive or tion methods require external information negative sentences from review documents (Ga- sources (e.g., WordNet, manually selected polar- napathibhotla and Liu, 2008). Hatzivassiloglou ity words, or training corpora). However, our and McKeown (1997) validated that language method identifies bipolar person names by simp- conjunctions, such as and, or, and but, are effec- ly analyzing person name usage patterns in topic tive indicators for judging the polarity of con- documents without using external information. joined adjectives. The authors observed that Finally, our method does not require any lan- most conjoined adjectives (77.84%) have the guage constructs, such as conjunctions; hence, it same orientation, while conjunctions that use but can be applied to different languages. generally connect adjectives of different orienta- tions. They proposed a log-linear regression 3 Method model that learns the distributions of conjunction indicators from a training corpus to predict the 3.1 Data Preprocessing polarity of conjoined adjectives. Turney and Given a set of topic documents, we first Littman (2003) manually selected seven positive decompose the documents into a set of non- and seven negative words as a polarity lexicon overlapping blocks B = {b1, b2, …, bn}. A block and proposed using pointwise mutual informa- can be a paragraph or a document, depending on tion (PMI) to calculate the polarity of a word. A the granularity of PCA sampling. Let U = {u1, word has a positive orientation if it tends to co- u2, …, um} be a set of textual units in B. In this occur with positive words; otherwise, it has a study, a unit refers to a person name. Then, the negative orientation. More recently, Esuli and document set can be represented as an mxn unit- Sebastiani (2006) developed a lexical resource, block association matrix A. A column in A, called SentiWordNet, which calculates the de- denoted as bi, represents a decomposed block i. grees of objective, positive, and negative senti- It is an m-dimensional vector whose j’th entry, ments of a synset in WordNet. The authors em- denoted as bi,j, is the frequency of uj in bi. In ployed a bootstrap strategy to collect training addition, a row in A, denoted as ui, represents a datasets for the sentiments and trained eight sen- textual unit i; and it is an n-dimensional vector timent classifiers to assign sentiment scores to a whose j’th entry, denoted as ui,j, is the frequency synset. Kanayama and Nasukawa (2006) posited of ui in bj. that polar clauses with the same polarity tend to appear successively in contexts. The authors de- 3.2 PCA-based Person Name Bipolarization rived the coherent precision and coherent density Principal component analysis is a well-known of a word in a training corpus to predict the statistical method that is used primarily to identi- word’s polarity. Ganapathibhotla and Liu (2008) fy the most important feature pattern in a high- investigated comparative sentences in product dimensional dataset (Smith, 2002). In our re- reviews. To identify the polarity of a compara- search, it identifies the most important unit pat- tive word (e.g., longer) with a product feature tern in the topic blocks by first constructing an (e.g., battery life), the authors collected phrases mxm unit relation matrix R, in which the (i,j)- that describe the Pros and Cons of products from entry (denoted as ri,j) denotes the correlation Epinions.com and proposed one-side association coefficient of ui and uj. The correlation is com- (OSA), which is a variant of PMI. OSA assigns a puted as follows: positive (negative) orientation to the compara- n ~ ~ tive-feature combination if the synonyms of the (ui,k ui ) (u j,k u j ) r corr(u ,u ) k1 , comparative word and feature tend to co-occur i, j i j n n in the Pros (resp. Cons) phrases. ~ 2 ~ 2 (ui,k ui ) (u j,k u j ) Our research differs from existing approaches k1 k1 ~ n ~ n in three respects. First, most works identify the where ui =1/n∑ k=1ui,k and uj =1/n∑ k=1uj,k are the polarity of adjectives and adverbs because the average frequencies of units i and j respectively. 171 The range of ri,j is within [-1,1] and the value difference between PCA and other eigenvector- represents the degree of correlation between ui based approaches lies in the way the unit relation and uj under the decomposed blocks. If ri,j = 0, matrix is constructed. PCA calculates ri,j by us- we say that ui and uj are uncorrelated; that is, ing the correlation coefficient, whereas the other occurrences of unit ui and unit uj in the blocks approaches employ the inner product or cosine 2 are independent of each other. If ri,j > 0, we say formula (Manning et al., 2008) to derive the that units ui and uj are positively correlated. That relationship between textual units. Specifically, is, ui and uj tend to co-occur in the blocks; oth- the correlation coefficient is identical to the co- erwise, both tend to be jointly-absent. If ri,j < 0, sine formula if we normalize each unit with its we say that ui and uj are negatively correlated; mean: that is, if one unit appears, the other tends not to n (u u ~ ) (u u ~ ) appear in the same block simultaneously. Note i,k i j,k j corr (u , u ) k 1 that if ri,j ≠ 0, |ri,j| scales the strength of a positive i j n n ~ 2 ~ 2 or negative correlation. Moreover, since the cor- (ui,k ui ) (u j,k u j ) k 1 k 1 relation coefficient is commutative, ri,j will be n identical to r such that matrix R will be symme- * * j,i u u tric. i,k j,k k 1 A unit pattern is represented as a vector v of n n * 2 * 2 dimension m in which the i’th entry vi indicates ui,k u j,k the weight of i’th unit in the pattern. Since ma- k 1 k 1 * * trix R depicts the correlation of the units in the cosine(u i , u j ), topic blocks, given a constituent of v, vTRv com- * ~ T * ~ where ui = ui – ui [1,1,…,1] ; uj = uj – uj [1, putes the variance of the pattern to characterize 1,…,1]T; and are the mean-normalized vectors of the decomposed blocks.

Load more