Semantic Role Labeling with Neural Network Factors

Total Page:16

File Type:pdf, Size:1020Kb

Semantic Role Labeling with Neural Network Factors Semantic Role Labeling with Neural Network Factors Nicholas FitzGeraldz∗ Oscar Täckströmy Kuzman Ganchevy Dipanjan Dasy zDepartment of Computer Science and Engineering, University of Washington y Google, New York [email protected] {oscart,kuzman,dipanjand}@google.com Abstract network is learned to capture correlations of the re- spective embedding dimensions to create argument We present a new method for semantic role and role representations. The similarity of these labeling in which arguments and seman- two representations, as measured by their dot prod- tic roles are jointly embedded in a shared uct, is used to score possible roles for candidate vector space for a given predicate. These arguments within a graphical model. This graphical embeddings belong to a neural network, model jointly models the assignment of semantic whose output represents the potential func- roles to all arguments of a predicate, subject to tions of a graphical model designed for structural linguistic constraints. the SRL task. We consider both local Our model has several advantages. Compared and structured learning methods and ob- to linear multiclass classifiers used in prior work, tain strong results on standard PropBank vector embeddings of the predictions overcome the and FrameNet corpora with a straightfor- assumption of modeling each semantic role as a ward product-of-experts model. We fur- discrete label, thus capturing fine-grained label sim- ther show how the model can learn jointly ilarity. Moreover, since predictions and inputs are from PropBank and FrameNet annotations embedded in the same vector space, and features to obtain additional improvements on the extracted from inputs and outputs are decoupled, smaller FrameNet dataset. our approach is amenable to joint learning of multi- 1 Introduction ple annotation conventions, such as PropBank and FrameNet, in a single model. Finally, as with other Semantic role labeling (SRL) is the task of iden- neural network approaches, our model obviates the tifying the semantic arguments of a predicate and need to manually engineer feature conjunctions. labeling them with their semantic roles. A key chal- Our underlying inference algorithm for SRL lenge in this task is sparsity of labeled data: a given follows Täckström et al. (2015), who presented predicate-role instance may only occur a handful a dynamic program for structured SRL; it is tar- of times in the training set. Most existing SRL geted towards the prediction of full argument spans. systems model each semantic role as an atomic Hence, we present empirical results on three span- unit of meaning, ignoring finer-grained semantic based SRL datasets: CoNLL 2005 and 2012 data similarity between roles that can be leveraged to annotated with PropBank conventions, as well as share context between similar labels, both within FrameNet 1.5 data. We also evaluate our system and across annotation conventions. on the dependency-based CoNLL 2009 shared task Low-dimensional embedding representations by assuming single word argument spans, that rep- have been shown to be successful in overcoming resent semantic dependencies, and limit our ex- sparsity and representing label similarity across a periments to English. On all datasets, our model wide range of tasks (Weston et al., 2011; Sriku- performs on par with a strong linear model base- mar and Manning, 2014; Hermann et al., 2014; line that uses hand-engineered conjunctive features. Lei et al., 2015). In this paper, we present a new Due to random parameter initialization and stochas- model for SRL that embeds candidate arguments ticity in the online learning algorithm used to train and semantic roles (in context of a predicate frame) our models, we observed considerable variance in in a shared vector space. A feed-forward neural performance across datasets. To resolve this vari- ∗Work carried out during an internship at Google. ance, we adopt a product-of-experts model that Theft steal.01 steal.V steal.V contains several hundred core and non-core roles John stole my car . John stole my car . that are shared across frames. For example, the Perpetrator Goods A0 A1 FrameNet frame Theft could be evoked by the verbs Theft lift.02 lift.V lift.V steal, pickpocket, or lift, while PropBank has dis- tinct frames for each of them. The Theft frame also Mary lifted a purse . Mary lifted a purse . Perpetrator Goods A0 A1 contains the core roles Goods and Perpetrator that (a) (b) additionally belong to the Commercial_transaction Figure 1: FrameNet (a) and PropBank (b) annota- and Committing_crime frames respectively. tions for two sentences. A typical SRL dataset consists of sentence-level annotations that identify (possibly multiple) target predicates in each sentence, a disambiguated frame combines multiple randomly-initialized instances for each predicate, and the associated argument of our model to achieve state-of-the-art results on spans (or single word argument heads) labeled with the CoNLL 2009 and FrameNet datasets, while their respective semantic roles. coming close to the previous best published results on the other two. Finally, we present even stronger 2.2 Related Work results for FrameNet data (which is scarce) by SRL using PropBank conventions (Palmer et al., jointly training the model with PropBank-annotated 2005) has been widely studied. There have been data. two shared tasks at CoNLL 2004-2005 to identify the phrasal arguments of verbal predicates (Car- 2 Background reras and Màrquez, 2004; Carreras and Màrquez, In this section, we briefly describe the SRL task 2005). The CoNLL 2008-2009 shared tasks in- and discuss relevant prior work. troduced a variant where semantic dependencies are annotated rather than phrasal arguments (Sur- 2.1 Semantic Role Labeling deanu et al., 2008; Hajicˇ et al., 2009). Similar SRL annotations rely on a frame lexicon containing approaches (Das et al., 2014; Hermann et al., 2014) frames that could be evoked by one or more lexical have been applied to frame-semantic parsing us- units. A lexical unit consists of a word lemma con- ing FrameNet conventions (Baker et al., 1998). We joined with its coarse-grained part-of-speech tag.1 treat PropBank and FrameNet annotations in a com- Each frame is further associated with a set of pos- mon framework, similar to Hermann et al. (2014). sible core and non-core semantic roles which are Most prior work on SRL rely on syntactic parses used to label its arguments. This description of a provided as input and use locally estimated classi- fiers for each span-role pair that are only combined frame lexicon covers both PropBank and FrameNet 2 conventions, but there are some differences out- at prediction time. This is done by picking the lined below. See Figure 1 for example annotations. highest scoring role for each span, subject to a set of structural constraints, such as avoiding overlap- PropBank defines frames that are essentially ping arguments and repeated core roles. Typically, sense distinctions of a given lexical unit. The set of these constraints have been enforced by integer lin- PropBank roles consists of seven generic core roles ear programming (ILP), as in Punyakanok et al. (labeled A0-A5 and AA) that assume different se- (2008). Täckström et al. (2015) interpreted this mantics for different frames, each associating with as a graphical model with local factors for each a subset of the core roles. In addition, there are 21 span-role pair, and global factors that encode the non-core roles that encapsulate further arguments structural constraints. They derived a dynamic pro- of a frame, such as temporal (AM-TMP) and locative gram (DP) that enforces most of the constraints (AM-LOC) adjuncts. The non-core roles are shared proposed by Punyakanok et al. and showed how between all frames and assume similar meaning. the DP can be used to take these constraints into In contrast, a FrameNet frame often associates account during learning. Here, we use an identical with multiple lexical units and the frame lexicon graphical model, but extend the model of Täck- 1We borrow the term “lexical unit” from the frame seman- ström et al. by replacing its linear potential func- tics literature. The CoNLL 2005 dataset is restricted to verbal lexical units, while the CoNLL 2009 and 2012 datasets con- 2Some recent work have successfully proposed joint mod- tains both verbal and nominal lexical units. FrameNet has els for syntactic parsing and SRL instead of a pipeline ap- lexical units of several coarse syntactic categories. proach (Lewis et al., 2015). tions with a multi-layer neural network. A similar tured graphical models (Srikumar and Manning, use of non-linear potential functions in a structured 2014), and in techniques to learn joint embeddings model was proposed by Do and Artières (2010) of predicate words and their semantic frames in a for speech recognition, and by Durrett and Klein vector space (Hermann et al., 2014). (2015) for syntactic phrase-structure parsing. Feature-based approaches to SRL employ hand- 3 Model engineered linguistically-motivated feature tem- Our model for SRL performs inference separately plates to represent the semantic structure. Some for each marked predicate in a sentence. It assumes recent work has focused on low-dimensional repre- that the predicate has been automatically disam- sentations that reduce the need for intensive feature biguated to a semantic frame drawn from a frame engineering and lead to better generalization in lexicon, and the semantic roles of the frame are the face of data sparsity. Lei et al. (2015) employ used for labeling the candidate arguments in the low-rank tensor factorization to induce a compact sentence. Formally, we are given a sentence x in representation of the full cross-product of atomic which a predicate t, with lexical unit `, has been features; akin to this work, they represent seman- marked. Assuming that the semantic frame f of the tic roles as real-valued vectors, but use a different predicate has already been identified (see §4.2 for scoring formulation for modeling potential argu- this step), we seek to predict the set of spans that ments.
Recommended publications
  • Classifying Relevant Social Media Posts During Disasters Using Ensemble of Domain-Agnostic and Domain-Specific Word Embeddings
    AAAI 2019 Fall Symposium Series: AI for Social Good 1 Classifying Relevant Social Media Posts During Disasters Using Ensemble of Domain-agnostic and Domain-specific Word Embeddings Ganesh Nalluru, Rahul Pandey, Hemant Purohit Volgenau School of Engineering, George Mason University Fairfax, VA, 22030 fgn, rpandey4, [email protected] Abstract to provide relevant intelligence inputs to the decision-makers (Hughes and Palen 2012). However, due to the burstiness of The use of social media as a means of communication has the incoming stream of social media data during the time of significantly increased over recent years. There is a plethora of information flow over the different topics of discussion, emergencies, it is really hard to filter relevant information which is widespread across different domains. The ease of given the limited number of emergency service personnel information sharing has increased noisy data being induced (Castillo 2016). Therefore, there is a need to automatically along with the relevant data stream. Finding such relevant filter out relevant posts from the pile of noisy data coming data is important, especially when we are dealing with a time- from the unconventional information channel of social media critical domain like disasters. It is also more important to filter in a real-time setting. the relevant data in a real-time setting to timely process and Our contribution: We provide a generalizable classification leverage the information for decision support. framework to classify relevant social media posts for emer- However, the short text and sometimes ungrammatical nature gency services. We introduce the framework in the Method of social media data challenge the extraction of contextual section, including the description of domain-agnostic and information cues, which could help differentiate relevant vs.
    [Show full text]
  • Injection of Automatically Selected Dbpedia Subjects in Electronic
    Injection of Automatically Selected DBpedia Subjects in Electronic Medical Records to boost Hospitalization Prediction Raphaël Gazzotti, Catherine Faron Zucker, Fabien Gandon, Virginie Lacroix-Hugues, David Darmon To cite this version: Raphaël Gazzotti, Catherine Faron Zucker, Fabien Gandon, Virginie Lacroix-Hugues, David Darmon. Injection of Automatically Selected DBpedia Subjects in Electronic Medical Records to boost Hos- pitalization Prediction. SAC 2020 - 35th ACM/SIGAPP Symposium On Applied Computing, Mar 2020, Brno, Czech Republic. 10.1145/3341105.3373932. hal-02389918 HAL Id: hal-02389918 https://hal.archives-ouvertes.fr/hal-02389918 Submitted on 16 Dec 2019 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Injection of Automatically Selected DBpedia Subjects in Electronic Medical Records to boost Hospitalization Prediction Raphaël Gazzotti Catherine Faron-Zucker Fabien Gandon Université Côte d’Azur, Inria, CNRS, Université Côte d’Azur, Inria, CNRS, Inria, Université Côte d’Azur, CNRS, I3S, Sophia-Antipolis, France I3S, Sophia-Antipolis, France
    [Show full text]
  • Semantic Role Labeling 2
    Semantic Role Labeling 2 Outline • Semantic role theory • Designing semantic role annotation project ▫ Granularity ▫ Pros and cons of different role schemas ▫ Multi-word expressions 3 Outline • Semantic role theory • Designing semantic role annotation project ▫ Granularity ▫ Pros and cons of different role schemas ▫ Multi-word expressions 4 Semantic role theory • Predicates tie the components of a sentence together • Call these components arguments • [John] opened [the door]. 5 Discovering meaning • Syntax only gets you so far in answering “Who did what to whom?” John opened the door. Syntax: NPSUB V NPOBJ The door opened. Syntax: NPSUB V 6 Discovering meaning • Syntax only gets you so far in answering “Who did what to whom?” John opened the door. Syntax: NPSUB V NPOBJ Semantic roles: Opener REL thing opened The door opened. Syntax: NPSUB V Semantic roles: thing opened REL 7 Can the lexicon account for this? • Is there a different sense of open for each combination of roles and syntax? • Open 1: to cause something to become open • Open 2: become open • Are these all the senses we would need? (1) John opened the door with a crowbar. Open1? (2) They tried the tools in John’s workshop one after the other, and finally the crowbar opened the door. Still Open1? 8 Fillmore’s deep cases • Correspondence between syntactic case and semantic role that participant plays • “Deep cases”: Agentive, Objective, Dative, Instrument, Locative, Factitive • Loosely associated with syntactic cases; transformations result in the final surface case 9 The door opened. Syntax: NPSUB V Semantic roles: Objective REL John opened the door. Syntax: NPSUB V NPOBJ Semantic roles: Agentive REL Objective The crowbar opened the door.
    [Show full text]
  • Information Extraction Based on Named Entity for Tourism Corpus
    Information Extraction based on Named Entity for Tourism Corpus Chantana Chantrapornchai Aphisit Tunsakul Dept. of Computer Engineering Dept. of Computer Engineering Faculty of Engineering Faculty of Engineering Kasetsart University Kasetsart University Bangkok, Thailand Bangkok, Thailand [email protected] [email protected] Abstract— Tourism information is scattered around nowa- The ontology is extracted based on HTML web structure, days. To search for the information, it is usually time consuming and the corpus is based on WordNet. For these approaches, to browse through the results from search engine, select and the time consuming process is the annotation which is to view the details of each accommodation. In this paper, we present a methodology to extract particular information from annotate the type of name entity. In this paper, we target at full text returned from the search engine to facilitate the users. the tourism domain, and aim to extract particular information Then, the users can specifically look to the desired relevant helping for ontology data acquisition. information. The approach can be used for the same task in We present the framework for the given named entity ex- other domains. The main steps are 1) building training data traction. Starting from the web information scraping process, and 2) building recognition model. First, the tourism data is gathered and the vocabularies are built. The raw corpus is used the data are selected based on the HTML tag for corpus to train for creating vocabulary embedding. Also, it is used building. The data is used for model creation for automatic for creating annotated data.
    [Show full text]
  • ACL 2019 Social Media Mining for Health Applications (#SMM4H)
    ACL 2019 Social Media Mining for Health Applications (#SMM4H) Workshop & Shared Task Proceedings of the Fourth Workshop August 2, 2019 Florence, Italy c 2019 The Association for Computational Linguistics Order copies of this and other ACL proceedings from: Association for Computational Linguistics (ACL) 209 N. Eighth Street Stroudsburg, PA 18360 USA Tel: +1-570-476-8006 Fax: +1-570-476-0860 [email protected] ISBN 978-1-950737-46-8 ii Preface Welcome to the 4th Social Media Mining for Health Applications Workshop and Shared Task - #SMM4H 2019. The total number of users of social media continues to grow worldwide, resulting in the generation of vast amounts of data. Popular social networking sites such as Facebook, Twitter and Instagram dominate this sphere. According to estimates, 500 million tweets and 4.3 billion Facebook messages are posted every day 1. The latest Pew Research Report 2, nearly half of adults worldwide and two- thirds of all American adults (65%) use social networking. The report states that of the total users, 26% have discussed health information, and, of those, 30% changed behavior based on this information and 42% discussed current medical conditions. Advances in automated data processing, machine learning and NLP present the possibility of utilizing this massive data source for biomedical and public health applications, if researchers address the methodological challenges unique to this media. In its fourth iteration, the #SMM4H workshop takes place in Florence, Italy, on August 2, 2019, and is co-located with the
    [Show full text]
  • A Second Look at Word Embeddings W4705: Natural Language Processing
    A second look at word embeddings W4705: Natural Language Processing Fei-Tzin Lee October 23, 2019 Fei-Tzin Lee Word embeddings October 23, 2019 1 / 39 Overview Last time... • Distributional representations (SVD) • word2vec • Analogy performance Fei-Tzin Lee Word embeddings October 23, 2019 2 / 39 Overview This time • Homework-related topics • Non-homework topics Fei-Tzin Lee Word embeddings October 23, 2019 3 / 39 Overview Outline 1 GloVe 2 How good are our embeddings? 3 The math behind the models? 4 Word embeddings, new and improved Fei-Tzin Lee Word embeddings October 23, 2019 4 / 39 GloVe Outline 1 GloVe 2 How good are our embeddings? 3 The math behind the models? 4 Word embeddings, new and improved Fei-Tzin Lee Word embeddings October 23, 2019 5 / 39 GloVe A recap of GloVe Our motivation: • SVD places too much emphasis on unimportant matrix entries • word2vec never gets to look at global co-occurrence statistics Can we create a new model that balances the strengths of both of these to better express linear structure? Fei-Tzin Lee Word embeddings October 23, 2019 6 / 39 GloVe Setting We'll use more or less the same setting as in previous models: • A corpus D • A word vocabulary V from which both target and context words are drawn • A co-occurrence matrix Mij , from which we can calculate conditional probabilities Pij = Mij =Mi Fei-Tzin Lee Word embeddings October 23, 2019 7 / 39 GloVe The GloVe objective, in overview Idea: we want to capture not individual probabilities of co-occurrence, but ratios of co-occurrence probability between pairs (wi ; wk ) and (wj ; wk ).
    [Show full text]
  • Cw2vec: Learning Chinese Word Embeddings with Stroke N-Gram Information
    The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18) cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information Shaosheng Cao,1,2 Wei Lu,2 Jun Zhou,1 Xiaolong Li1 1 AI Department, Ant Financial Services Group 2 Singapore University of Technology and Design {shaosheng.css, jun.zhoujun, xl.li}@antfin.com [email protected] Abstract We propose cw2vec, a novel method for learning Chinese word embeddings. It is based on our observation that ex- ploiting stroke-level information is crucial for improving the learning of Chinese word embeddings. Specifically, we de- sign a minimalist approach to exploit such features, by us- ing stroke n-grams, which capture semantic and morpholog- ical level information of Chinese words. Through qualita- tive analysis, we demonstrate that our model is able to ex- Figure 1: Radical v.s. components v.s. stroke n-gram tract semantic information that cannot be captured by exist- ing methods. Empirical results on the word similarity, word analogy, text classification and named entity recognition tasks show that the proposed approach consistently outperforms Bojanowski et al. 2016; Cao and Lu 2017). While these ap- state-of-the-art approaches such as word-based word2vec and proaches were shown effective, they largely focused on Eu- GloVe, character-based CWE, component-based JWE and ropean languages such as English, Spanish and German that pixel-based GWE. employ the Latin script in their writing system. Therefore the methods developed are not directly applicable to lan- guages such as Chinese that employ a completely different 1. Introduction writing system. Word representation learning has recently received a sig- In Chinese, each word typically consists of less charac- nificant amount of attention in the field of natural lan- ters than English1, where each character conveys fruitful se- guage processing (NLP).
    [Show full text]
  • Effects of Pre-Trained Word Embeddings on Text-Based Deception Detection
    2020 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress Effects of Pre-trained Word Embeddings on Text-based Deception Detection David Nam, Jerin Yasmin, Farhana Zulkernine School of Computing Queen’s University Kingston, Canada Email: {david.nam, jerin.yasmin, farhana.zulkernine}@queensu.ca Abstract—With e-commerce transforming the way in which be wary of any deception that the reviews might present. individuals and businesses conduct trades, online reviews have Deception can be defined as a dishonest or illegal method become a great source of information among consumers. With in which misleading information is intentionally conveyed 93% of shoppers relying on online reviews to make their purchasing decisions, the credibility of reviews should be to others for a specific gain [2]. strongly considered. While detecting deceptive text has proven Companies or certain individuals may attempt to deceive to be a challenge for humans to detect, it has been shown consumers for various reasons to influence their decisions that machines can be better at distinguishing between truthful about purchasing a product or a service. While the motives and deceptive online information by applying pattern analysis for deception may be hard to determine, it is clear why on a large amount of data. In this work, we look at the use of several popular pre-trained word embeddings (Word2Vec, businesses may have a vested interest in producing deceptive GloVe, fastText) with deep neural network models (CNN, reviews about their products.
    [Show full text]
  • Syntax Role for Neural Semantic Role Labeling
    Syntax Role for Neural Semantic Role Labeling Zuchao Li Hai Zhao∗ Shanghai Jiao Tong University Shanghai Jiao Tong University Department of Computer Science and Department of Computer Science and Engineering Engineering [email protected] [email protected] Shexia He Jiaxun Cai Shanghai Jiao Tong University Shanghai Jiao Tong University Department of Computer Science and Department of Computer Science and Engineering Engineering [email protected] [email protected] Semantic role labeling (SRL) is dedicated to recognizing the semantic predicate-argument struc- ture of a sentence. Previous studies in terms of traditional models have shown syntactic informa- tion can make remarkable contributions to SRL performance; however, the necessity of syntactic information was challenged by a few recent neural SRL studies that demonstrate impressive performance without syntactic backbones and suggest that syntax information becomes much less important for neural semantic role labeling, especially when paired with recent deep neural network and large-scale pre-trained language models. Despite this notion, the neural SRL field still lacks a systematic and full investigation on the relevance of syntactic information in SRL, for both dependency and both monolingual and multilingual settings. This paper intends to quantify the importance of syntactic information for neural SRL in the deep learning framework. We introduce three typical SRL frameworks (baselines), sequence-based, tree-based, and graph-based, which are accompanied by two categories of exploiting syntactic information: syntax pruning- based and syntax feature-based. Experiments are conducted on the CoNLL-2005, 2009, and 2012 benchmarks for all languages available, and results show that neural SRL models can still benefit from syntactic information under certain conditions.
    [Show full text]
  • Semantic Role Labeling Tutorial NAACL, June 9, 2013
    Semantic Role Labeling Tutorial NAACL, June 9, 2013 Part 1: Martha Palmer, University of Colorado Part 2: Shumin Wu, University of Colorado Part 3: Ivan Titov, Universität des Saarlandes 1 Outline } Part 1 Linguistic Background, Resources, Annotation Martha Palmer, University of Colorado } Part 2 Supervised Semantic Role Labeling and Leveraging Parallel PropBanks Shumin Wu, University of Colorado } Part 3 Semi- , unsupervised and cross-lingual approaches Ivan Titov, Universität des Saarlandes, Universteit van Amsterdam 2 Motivation: From Sentences to Propositions Who did what to whom, when, where and how? Powell met Zhu Rongji battle wrestle join debate Powell and Zhu Rongji met consult Powell met with Zhu Rongji Proposition: meet(Powell, Zhu Rongji) Powell and Zhu Rongji had a meeting meet(Somebody1, Somebody2) . When Powell met Zhu Rongji on Thursday they discussed the return of the spy plane. meet(Powell, Zhu) discuss([Powell, Zhu], return(X, plane)) Capturing semantic roles SUBJ } Dan broke [ the laser pointer.] SUBJ } [ The windows] were broken by the hurricane. SUBJ } [ The vase] broke into pieces when it toppled over. PropBank - A TreeBanked Sentence (S (NP-SBJ Analysts) S (VP have (VP been VP (VP expecting (NP (NP a GM-Jaguar pact) have VP (SBAR (WHNP-1 that) (S (NP-SBJ *T*-1) NP-SBJ been VP (VP would Analysts (VP give expectingNP (NP the U.S. car maker) SBAR (NP (NP an eventual (ADJP 30 %) stake) NP S (PP-LOC in (NP the British company)))))))))))) a GM-Jaguar WHNP-1 VP pact that NP-SBJ VP *T*-1 would NP give Analysts have been expecting a GM-Jaguar NP PP-LOC pact that would give the U.S.
    [Show full text]
  • Efficient Pairwise Document Similarity Computation in Big Datasets
    International Journal of Database Theory and Application Vol.8, No.4 (2015), pp.59-70 http://dx.doi.org/10.14257/ijdta.2015.8.4.07 Efficient Pairwise Document Similarity Computation in Big Datasets 1Papias Niyigena, 1Zhang Zuping*, 2Weiqi Li and 1Jun Long 1School of Information Science and Engineering, Central South University, Changsha, 410083, China 2School of Electronic and Information Engineering, Xi’an Jiaotong University Xian, 710049, China [email protected], * [email protected], [email protected], [email protected] Abstract Document similarity is a common task to a variety of problems such as clustering, unsupervised learning and text retrieval. It has been seen that document with the very similar content provides little or no new information to the user. This work tackles this problem focusing on detecting near duplicates documents in large corpora. In this paper, we are presenting a new method to compute pairwise document similarity in a corpus which will reduce the time execution and save space execution resources. Our method group shingles of all documents of a corpus in a relation, with an advantage of efficiently manage up to millions of records and ease counting and aggregating. Three algorithms are introduced to reduce the candidates shingles to be compared: one creates the relation of shingles to be considered, the second one creates the set of triples and the third one gives the similarity of documents by efficiently counting the shared shingles between documents. The experiment results show that our method reduces the number of candidates pairs to be compared from which reduce also the execution time and space compared with existing algorithms which consider the computation of all pairs candidates.
    [Show full text]
  • Cross-Lingual Semantic Role Labeling with Model Transfer Hao Fei, Meishan Zhang, Fei Li and Donghong Ji
    1 Cross-lingual Semantic Role Labeling with Model Transfer Hao Fei, Meishan Zhang, Fei Li and Donghong Ji Abstract—Prior studies show that cross-lingual semantic role transfer. The former is based on large-scale parallel corpora labeling (SRL) can be achieved by model transfer under the between source and target languages, where the source-side help of universal features. In this paper, we fill the gap of sentences are automatically annotated with SRL tags and cross-lingual SRL by proposing an end-to-end SRL model that incorporates a variety of universal features and transfer methods. then the source annotations are projected to the target-side We study both the bilingual transfer and multi-source transfer, sentences based on word alignment [7], [9], [10], [11], [12]. under gold or machine-generated syntactic inputs, pre-trained Annotation projection has received the most attention for cross- high-order abstract features, and contextualized multilingual lingual SRL. Consequently, a range of strategies have been word representations. Experimental results on the Universal proposed for enhancing the SRL performance of the target Proposition Bank corpus indicate that performances of the cross-lingual SRL can vary by leveraging different cross-lingual language, including improving the projection quality [13], joint features. In addition, whether the features are gold-standard learning of syntax and semantics information [7], iteratively also has an impact on performances. Precisely, we find that bootstrapping to reduce the influence of noise target corpus gold syntax features are much more crucial for cross-lingual [12], and joint translation-SRL [8]. SRL, compared with the automatically-generated ones.
    [Show full text]