Linked Document Embedding for Classification

Linked Document Embedding for Classification Suhang Wangy, Jiliang Tangz, Charu Aggarwal#, and Huan Liuy yComputer Science & Engineering, Arizona State University, Tempe, AZ, USA zComputer Science & Engineering, Michigan State University, East Lansing, MI, USA #IBM T. J. Watson Research Center, Yorktown Heights, NY, USA y{suhang.wang,huan.liu}@asu.edu, [email protected], #[email protected] ABSTRACT demonstrated superior performance in many tasks such as Word and document embedding algorithms such as Skip- word analogy [18], parsing [9], POS tagging [9], and sen- gram and Paragraph Vector have been proven to help vari- timent analysis [11]. The assumption behind these docu- ous text analysis tasks such as document classification, doc- ment/word embedding approaches is basically the distribu- ument clustering and information retrieval. The vast ma- tional hypothesis that \you shall know a word by the com- jority of these algorithms are designed to work with inde- pany it keeps [10]." They embed words or documents into pendent and identically distributed documents. However, a low dimensional space, which can alleviate the curse of in many real-world applications, documents are inherently dimensionality and data sparsity problems suffered by tra- linked. For example, web documents such as blogs and on- ditional representations such as bag-of-words and N-gram. line news often have hyperlinks to other web documents, and The vast majority of existing document embedding al- scientific articles usually cite other articles. Linked docu- gorithms work with \flat" data and documents are usually ments present new challenges to traditional document em- assumed to be independent and identically distributed (or bedding algorithms. In addition, most existing document i.i.d. assumption). However, in many real-world scenarios, embedding algorithms are unsupervised and their learned documents are inherently linked. For example, web docu- representations may not be optimal for classification when ments such as blogs and online news often contain hyperlinks labeling information is available. In this paper, we study to other web documents, and scientific articles commonly the problem of linked document embedding for classification cite other articles. A toy example of linked documents is and propose a linked document embedding framework LDE, illustrated in Figure 1 where fd1; d2; : : : ; d5g are documents which combines link and label information with content in- and fw1; w2; : : : ; w8g are words in documents. In addition formation to learn document representations for classifica- to content information, documents are linked and links sug- tion. Experimental results on real-world datasets demon- gest the inter-dependence of documents. Hence, the i.i.d. strate the effectiveness of the proposed framework. Further assumption of documents does not hold [33]. Additional experiments are conducted to understand the importance of link information of such documents has been shown to be link and label information in the proposed framework LDE. useful in various text mining tasks such as document classification [33, 8], document clustering [14, 29] and feature selection [27]. Therefore, we propose to study the novel Keywords problem of linked document embedding following the distri- Document Embedding, Linked Data, Word Embedding butional hypothesis. Most existing document embedding algorithms use unsu- 1. INTRODUCTION pervised learning, such as those in [18, 13, 32]. The rep- A meaningful and discriminative representation for doc- resentations learned by these algorithms are very general uments can help many text analysis tasks such as docu- and can be applied to various tasks. However, they may ment classification, document clustering and information re- not be optimal for some specialized tasks where label in- trieval. Many document representation methods are pro- formation is available such as y2 for d2 and y5 for d5 in posed such as bag-of-words, N-gram, latent semantic anal- Figure 1(a). For example, deep learning algorithms such ysis [12], latent Dirichlet allocation [5] and word/document as convolutional neural networks [11], which use label infor- embedding [18, 17, 13]. Among these algorithms, the re- mation, often outperform text embeddings for classification cently proposed distributed representations of words and tasks [23]. Hence, in this paper we study the novel problem documents such as Skip-gram [18, 17] and PV-DM [13] have of linked document embedding for classification and investigate two specific problems: (1) how to capture link and label information mathematically; and (2) how to exploit Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed them for document embedding. In an attempt to address for profit or commercial advantage and that copies bear this notice and the full cita- these two problems, we propose a novel linked document tion on the first page. Copyrights for components of this work owned by others than embedding (LDE) framework for classification. The major ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- publish, to post on servers or to redistribute to lists, requires prior specific permission contributions of the paper are summarized next: and/or a fee. Request permissions from [email protected]. • We provide a principled way to capture link and label CIKM’16 , October 24-28, 2016, Indianapolis, IN, USA information mathematically; c 2016 ACM. ISBN 978-1-4503-4073-1/16/10. $15.00 DOI: http://dx.doi.org/10.1145/2983323.2983755 • We propose a novel framework LDE, which learns word (a) Linked Documents (b) Three Types of Relations Figure 1: A Toy Example of Linked Documents. fd1; d2; : : : ; d5g are documents; fw1; w2; : : : ; w8g are words; y2 is the label of d2 and y5 is the label of d5. and document embeddings for classification by com- which learn the embeddings of words by utilizing word co- bining link and label information with content infor- occurrence in the local context [18, 17]. It has been proven to mation; and be powerful to capture the semantic and syntactic meanings of words and can benefit many natural language process- • We conduct experiments on real-world datasets to un- ing tasks such as word analogy [18], parsing [9], POS tag- derstand the effectiveness of the proposed framework ging [9], and sentiment analysis [11]. It is also scalable and LDE. can handle millions of documents. Based on the same distributed representation idea, [13] extended the word embed- The rest of the paper is organized as follows. In Section 2, ding model to document embedding (PV-DM, PV-DBOW) we briefly review the related work. In Section 3, we formally by finding document representations that are good at pre- define the problem of linked document embedding for classi- dicting words in the document. Document embedding has fication. In Section 4, we introduce the proposed framework also been proven to be powerful in many tasks such as senti- LDE with the details about how to model link and label ment analysis [13], machine translation [28] and information information and how to incorporate them in document and retrieve [20]. Recently, predictive text embedding algorithm word embedding. In Section 5, we show how to solve the (PTE) is proposed in [23], which also utilizes label infor- optimization problem of LDE along with solutions to accel- mation to learn predictive text embeddings. The proposed erate the learning process. In Section 6, we present empirical framework LDE is inherently different from PTE: (1) LDE is evaluation with discussion. In Section 7, we conclude with developed for linked documents while PTE still assumes doc- future work. uments to be i.i.d.; (2) LDE captures label information via modeling document-label information while PTE uses label 2. RELATED WORK information via word-label information; (3) in addition to label information, LDE also models link information among In this paper, we investigate linked document representa- documents to learn document embeddings; and (4) the pro- tion for classification, which is mainly related to document posed formulations and optimization problems of LDE are representation, linked data and graph-based classification. also different from those of PTE. 2.1 Document Representation Document representation is an important research area 2.2 Linked Document Representation that receives great attention lately and can benefit many Documents in many real-world applications are inherently machine learning and data mining tasks such as document linked. For example, web pages are linked by hyperlinks and classification [23], information retrieval [31, 20] and senti- scientific papers are linked by citations. Link information ment analysis [13]. Many different types of models have been has been proven to be very effective for machine learning and proposed for document representation. Bog-of-words [21] is data mining such as feature selection [26, 27], recommender one of the most widely used one. It is simple to imple- systems [15, 16], and document classification/clustering [19, ment, but not scalable since as the number of documents in- 3]. Based on the idea that two linked documents are likely to creases, the vocabulary size can become huge. At the same share similar topics, several works have been proposed to uti- time, it suffers from data sparsity and curse of dimensional- lize link information for better document representations [7, ity problems and the semantic relatedness between different 35, 32]. For example, RTM [7] extends LDA by considering words is omitted. To mitigate the high dimensionality and link information for topic modeling; PMTLM [35] combines data sparsity problems of BOW, Latent Semantic Analy- topic modeling with a variant of mixed-membership block sis [12] uses dimensionality reduction technique, i.e., SVD, model to model linked documents and TADW [32] learns to project the document-word matrix to a low dimension linked document representations based on matrix factoriza- space.

Load more