Semantic Representations in Text Data 1

International Journal of Grid and Distributed Computing Vol. 11, No. 9 (2018), pp.65-80 http://dx.doi.org/10.14257/ijgdc.2018.11.9.06 Semantic Representations in Text Data 1 Triveni Lal Pal*, Madhu Kumari, Tajinder Singh and Mohammad Ahsan Department of Computer Science and Engineering National Institute of Technology Hamirpur, Hamirpur (H.P.), India [email protected], [email protected], [email protected], [email protected] Abstract Automatic text mining processes and other sophisticated natural language processing constructs need realistic representations of text/documents which embed semantics efficiently. All the representations work on the notion that every data contains different explanatory factors (attributes). In this article, we exploit these explanatory factors to study and compare various semantic representation methods for text documents. The article critically reviews recent trends in the area of semi-supervised semantic representations, covering cutting-edge methods in distributed representations such as embeddings. This article gives a broad and synthesized description of various forms of text representations, presented in their chronological order ranging from BoW models to the most recent embeddings learning. Conclusively, various findings taken together provide valuable pointers for researchers looking to work in the field of semantic representations. In addition, the article also shows that one need to develop a model for learning universal embeddings in unsupervised/semi-supervised settings that incorporate contextual as well as word-order information, with language independent features and which would be feasible for large dataset. Keywords: Word Embeddings, Vector Space Model, Long-Short-Term Memory, Recurrent Neural Networks 1. Introduction Broadly, text/documents have two representation approaches. One is syntax-based approach; based on linguistic theory, grammar and logic [1]. Another is semantic approach; based on statistical techniques, past history and its learning. These representations vary in the way they can incorporate and hide more or less the different attributes of variation behind the data. For example, one hot vector with binary entries only shows presence or absence of a word in a particular document and it ignores frequency information. While, vector space model is relatively more informative and depicts how frequent a particular word is, in a given document; as actual frequency of occurrence of a term does give better performance on various predictive methods in machine learning. Some NLP applications such as grammar checking and statistical machine translation need support from the syntax of the input language. However, for more ambitious goals of text mining like abstractive summarization and text categorization, an essential requirement is a theory of semantic representation using background (contextual) knowledge, and how it is related to and interacts with this representation. This paper systematically surveys issues that researchers in the latter category (i.e., Semantic approach) grapple with. Received (January 18, 2018), Review Result (May 14, 2018), Accepted (May 25, 2018) * Corresponding Author ISSN: 2005-4262 IJGDC Copyright ⓒ 2018 SERSC Australia International Journal of Grid and Distributed Computing Vol. 11, No. 9 (2018) Even though logicist approaches and cognitive science approaches are better in semantic representation, they use more linguistic features and need more human intervention. To automate the text processing, we need to look at some statistical approach. We, therefore, in this paper have considered statistical approaches for semantic representations (statistical semantics) with more emphasis. In conventional approaches like Vector Space Model (VSM) [2], text has been represented as a bag of words (BoW). VSM, in its most basic form, use Boolean entries for each element in the vector to indicate presence or absence of the word in the document. Further, term-frequency (tf), term frequency-inverse document frequency (tf- idf), point-wise positive mutual information (PPMI), etc. were used as weighting factors to capture the notion that all words cannot be equally important in the document. The vector space model considers numerical feature vectors in a Euclidean space. Each word in VSM was treated as independent from other, thus losing the semantic relation between the words. BoW ignores word-order, thus missing important semantic relations between the words. Indeed, researchers, in the text mining community, proposed ingenious solutions to incorporate the semantic relations (word-order) in the vector space model. N- gram statistical language model [3] is one of such attempts. N-gram model intended to incorporate semantics by using context word in predicting the target word. The target word is predicted using conditional probability . Where, wn is the target word and words are called the context. Though, n-gram has been widely used, it has the curse of its dimensionality. Alternatively, ontologies deal with semantic heterogeneity in structured data and as conceptual models, provide the necessary framework for the semantic representation of textual information. Ontology representation is based on the concept that is the principal link between texts. Ontology links concept labels to their interpretations, i.e., specifications of their meanings, including concept definitions and relations to other concepts [4]. Ontologies find applications in text summarization, document clustering, text annotation, biomedicine, attitude analysis and entity and event Annotation etc., Ontologies need hand-coded knowledge to semantics, so it is more labor intensive and systematically excluded from this study. Graphs or semantic graphs are proven to be more prominent than ontologies [5]. In a semantic graph or semantic network representation, knowledge is represented by nodes and connections between nodes. Nodes represent things, concepts, and states; whereas labeled edges represent relationships among nodes. Semantic graph is based on the assumption that higher the degree of connection between the words or concepts, more similar the words be in the meaning. Graphical representation of a text document is powerful because it can helpful in most of the operations in text such as topology, relational, statistical etc. Though, a semantic network is intuitive, natural, and easy to understand; however, due to the absence of a quantifier in semantic networks, describing the complex relationship is difficult. Another approach that has gain most attention of all semantic space models and known for its ability to incorporate hidden semantic relations between words/documents is Latent Semantic Analysis (LSA) [6]. Though LSA too uses BoW approach, it proved to be better than basic VSM because of its unique dimensionality reduction algorithm. LSA first forms a term-document matrix using a document collection and then finds its low-rank approximation using singular value decomposition (SVD). LSA has the capability of finding out hidden semantic relations (that’s why the name ‘latent’) even if the two words never co-occurred in the document. The basic idea behind LSA’s meaning induction of a word is the aggregate contexts (in which a word does or does not occur), that produce a set of constraints which generates the meaning of the word. Firth (1957) has put this idea as, “you shall know a word by the company it keeps.” LSA has a wide range of applications in psycholinguistic phenomena, from judgments of semantic similarity to judgments of essay quality and discourse comprehension. 66 Copyright ⓒ 2018 SERSC Australia International Journal of Grid and Distributed Computing Vol. 11, No. 9 (2018) Despite its success, LSA has been criticized for its BoW approach which ignores statistical information regarding word-order. As, semantics of a word is not only defined by its context but also it is the sequence of word occurrences that contribute to the overall meaning of the word. In recent years, low dimensional, dense vectors called “Word embeddings” based on neural networks learning, gaining attention for semantic learning. These methods are quite successful in learning the semantic representation of words. The skip-gram model and continuous bag-of-word model (CBOW) [7] [8] are popular machine learning approaches for learning word representations that can be efficiently trained on large amounts of text data. CBOW has the training objective to combine the representations of surrounding (context) words to predict the word in the middle (target word). Whereas, the skip-gram model trained to predict the source context words based on the target word. Recently, [9] uses word embeddings in another approach that uses spatial distance to show word relatedness known as “Semantic word cloud”. This approach better visualizes the semantic relatedness between the words by improving the aesthetic on word layout. Tensors [10], alternatives to vectors, are multidimensional objects that have been used for information retrieval, document analysis, text categorization, etc. Tensors paved to be more prominent over vectors as dimensionality reduction in VSM is limited by the number of samples, whereas, TSM has no such limit. HOSVD under TSM can reduce the data to any dimension. The paper is organized in following sections. Next section (Section 2), built the basis for the study, looking back brief history in the text representation. Various representation approaches with more emphasis on distributed representations

Load more