Arxiv:2008.01739V2 [Cs.CL] 4 Jun 2021 Are Useful for Many Natural Language Processing Target Document

Total Page:16

File Type:pdf, Size:1020Kb

Arxiv:2008.01739V2 [Cs.CL] 4 Jun 2021 Are Useful for Many Natural Language Processing Target Document Select, Extract and Generate: Neural Keyphrase Generation with Layer-wise Coverage Attention Wasi Uddin Ahmady∗, Xiao Baiz, Soomin Leez, Kai-Wei Changy yUniversity of California, Los Angeles, zYahoo Research yfwasiahmad,[email protected] zfxbai,[email protected] Abstract Title: [1] natural language processing technologies Natural language processing techniques have for developing a language learning environment . demonstrated promising results in keyphrase Abstract: [1] so far , computer assisted language generation. However, one of the major chal- learning ( call ) comes in many different flavors lenges in neural keyphrase generation is pro- . [1] our research work focuses on developing an cessing long documents using deep neural net- integrated e learning environment that allows im- works. Generally, documents are truncated be- proving language skills in specific contexts . [1] fore given as inputs to neural networks. Conse- integrated e learning environment means that it quently, the models may miss essential points is a web based solution . , for instance , web conveyed in the target document. To overcome browsers or email clients . [0] it should be accessi- this limitation, we propose SEG-Net, a neural ble . [1] natural language processing ( nlp ) forms keyphrase generation model that is composed the technological basis for developing such a learn- of two major components, (1) a selector that ing framework . [0] the paper gives an overview selects the salient sentences in a document and . [0] therefore , on the one hand , it explains cre- (2) an extractor-generator that jointly extracts ation . [0] on the other hand , it describes existing and generates keyphrases from the selected nlp standards . [0] based on our requirements , the sentences. SEG-Net uses Transformer, a self- attentive architecture, as the basic building paper gives . [1] . necessary developments in e block with a novel layer-wise coverage atten- learning to keep in mind . tion to summarize most of the points discussed Present: natural language processing; computer in the document. The experimental results on assisted language learning; integrated e learning seven keyphrase generation benchmarks from Absent: semantic web technologies; learning of scientific and web documents demonstrate that foreign languages SEG-Net outperforms the state-of-the-art neu- Figure 1: Example of a document with present and ab- ral generative methods by a large margin. sent keyphrases. The value (0/1) in brackets ([]) repre- sent sentence salience label. 1 Introduction Keyphrases are short pieces of text that summa- ument, while absent keyphrases are only semanti- rize the key points discussed in a document. They cally related and have partial or no overlap to the arXiv:2008.01739v2 [cs.CL] 4 Jun 2021 are useful for many natural language processing target document. We provide an example of a target and information retrieval tasks (Wilson et al., 2005; document and its keyphrases in Figure1. Berend, 2011; Tang et al., 2017; Subramanian et al., In recent years, the neural sequence-to-sequence 2018; Zhang et al., 2017b; Wan and Xiao, 2008; (Seq2Seq) framework (Sutskever et al., 2014) Jones and Staveley, 1999; Kim et al., 2013; Hulth has become the fundamental building block in and Megyesi, 2006; Hammouda et al., 2005; Wu keyphrase generation models. Most of the existing and Bolivar, 2008; Dave and Varma, 2010). In the approaches (Meng et al., 2017; Chen et al., 2018; automatic keyphrase generation task, the input is a Yuan et al., 2020; Chen et al., 2019b) adopt the document, and the output is a set of keyphrases that Seq2Seq framework with attention (Luong et al., can be categorized as present or absent keyphrases. 2015; Bahdanau et al., 2014) and copy mechanism Present keyphrases appear exactly in the target doc- (See et al., 2017; Gu et al., 2016). However, present ∗Work done during internship at Yahoo Research. phrases indicate the indispensable segments of a target document. Emphasizing on those segments to show that selecting salient sentences improve improves document understanding that can lead a present keyphrase extraction and the layer-wise model to coherent absent phrase generation. This coverage attention and facilitates absent keyphrase motivates to jointly model keyphrase extraction and generation. Our novel contributions are as follows. generation (Chen et al., 2019a). 1. SEG-Net that identifies the salient sentences To generate a comprehensive set of keyphrases, in the target document first and then use them reading the complete target document is necessary. to generate a set of keyphrases. However, to the best of our knowledge, none of the 2. A layer-wise coverage attention. previous neural methods read the full content of 2 Problem Definition a document as it can be thousands of words long. Existing models truncate the target document; take Keyphrase generation task is defined as given a the first few hundred words as input and ignore text document x, generate a set of keyphrases the rest of the document that may contain salient K = fk1; k2; : : : ; kjKjg where the document i information. On the contrary, a significant frac- x = [x1; : : : ; xjxj] and each keyphrase k = i i tion of a long document may not associate with the [k1; : : : ; kjkij] is a sequence of words. A text keyphrases. Presumably, selecting the salient seg- document can be split into a list of sentences, 1 2 jSj i ments from the target document and then predicting Sx = [sx; sx; : : : ; sx ] where each sentence sx = the keyphrases from them would be effective. [xj; : : : ; xj+jsi|−1] is a consecutive subsequence of To address the aforementioned challenges, in the document x with begin index j ≤ jxj and end i this paper, we propose SEG-Net (stands for Select, index (j + js j) < jxj. In literature, keyphrases Extract, and Generate) that has two major compo- are categorized into two types, present and ab- nents, (1) a sentence-selector that selects the salient sent. A present keyphrase is a consecutive subse- sentences in a document, and (2) an extractor- quence of the document, while an absent keyphrase generator that predicts the present keyphrases and is not. However, an absent keyphrase may have generates the absent keyphrases jointly. The moti- a partial overlapping with the document’s word sequence. We denote the sets of present and ab- vation to design the sentence-selector is to decom- p 1 2 jK j pose a long target document into a list of sentences, sent keyphrases as Kp = fkp; kp; : : : ; kp g and a and identify the salient ones for keyphrase gener- 1 2 jK j Ka = fka; ka; : : : ; ka g, respectively. Hence, we ation. We consider a sentence as salient if it con- can express a set of keyphrases as K = Kp [Ka. tains present keyphrases or overlaps with absent SEG-Net decomposes the keyphrase generation keyphrases. As shown in Figure1, we split the task into three sub-tasks. We define them below. document into a list of sentences and classify them with salient and non-salient labels. A similar notion Task 1 (Salient Sentence Selection). Given a list of sentences Sx, predict a binary label (0=1) for is adopted in prior works on text summarization i (Chen and Bansal, 2018; Lebanoff et al., 2019) and each sentence sx. The label 1 indicates that the sen- question answering (Min et al., 2018). We employ tence contains a present keyphrase or overlaps with an absent keyphrase. The output of the selector is Transformer (Vaswani et al., 2017) as the backbone sal of the extractor-generator in SEG-Net. a list of salient sentences Sx . We equip the extractor-generator with a novel Task 2 (Present Keyphrase Extraction ). Given sal layer-wise coverage attention such that the gener- Sx as a concatenated sequence of words, predict ated keyphrases summarize the entire target doc- a label (B/I/O) for each word that indicates if it is a ument. The layer-wise coverage attention keeps constituent of a present keyphrase. track of the target document segments that are covered by previously generated phrases to guide Task 3 (Absent Keyphrase Generation). Given sal the self-attention mechanism in Transformer while Sx as a concatenated sequence of words, gen- attending the encoded target document in future erate a concatenated sequence of keyphrases in a generation steps. We evaluate SEG-Net on five sequence-to-sequence fashion. benchmarks from scientific articles and two bench- 3 SEG-Net for Keyphrase Generation marks from web documents to demonstrate its ef- fectiveness over the state-of-the-art neural gener- Our proposed model, SEG-Net jointly learns to ex- ative methods. We perform ablation and analysis tract and generate present and absent keyphrases from the salient sentences in a target document. The key advantage of SEG-Net is the maximal utilization of the information from the input text in order to generate a set of keyphrases that sum- marize all the key points in the target document. SEG-Net consists of a sentence-selector and an extractor-generator. The sentence-selector iden- tifies the salient sentences from the target docu- ment (Task1) that are fed to the extractor-generator to predict both the present and absent keyphrases (Task2,3). We detail them in this section. 3.1 Embedding Layer The embedding layer maps each word in an in- put sequence to a low-dimensional vector space. We train three embedding matrices, We;Wpos; and Wseg that convert a word, its absolute position, and segment index into vector representations of size dmodel. The segment index of a word indicates the index of the sentence that it belongs to. In addition, we obtain a character-level embedding for each word using Convolutional Neural Networks (CNN) (Kim, 2014a). To learn a fixed-length vector rep- resentation of a word, we add the four embedding vectors element-wise. To form the vector represen- tations of the keyphrase tokens, we only use their word and character-level embeddings.
Recommended publications
  • SCHEDULE + AGENDA *Agenda Is Not Final
    SCHEDULE + AGENDA *Agenda is not final. We are still adding content and there may be changes to the schedule. SUMMIT DAY ONE :​ THURSDAY, MARCH 1ST ​ Tech Crawl + Opening Keynote Speaker ​ ​ Opening Address & Badge Pickup ​:​ ​CASTRO THEATRE 5.30PM ​ Tech Crawl ​:​ Check out demos, technology, and more. ​ 6PM-9PM MEET AT CASTRO THEATRE FOR ALL LOCATIONS Lesbians Who 12-Step ​:​ Come build community & connect with others before the festivities kick off. 6:30PM ​ ​ #offtherecord ​:​ ​CASTRO COUNTRY CLUB, 4058 18TH ST. Leaders​ ​: Lara Avisov Cat Obuhoff Diversity Advisory Council Corporate Operations Engineer & Recruiting, Uber @lavisov at Dropbox, @cathtera ​ ​ ​ SUMMIT DAY TWO :​ FRIDAY, MARCH 2ND ​ Badge Pickup​ :​ CASTRO THEATRE 8.30AM ​ ​ Doors Open + Registration :​ CASTRO THEATRE 8.30AM ​ ​ 8.30AM :​ SPEED MENTORING ​ SIGN UP FOR ONE OF OUR ALL DAY MENTORING SESSIONS HERE: bit.ly/SF18SUMMITMENTORING TECH PAVILION 2 MEETUPS = NETWORKING THAT DOES NOT SUCK + SPEED MENTORING 8:30AM :​ Edie Windsor Coding Scholarship Meetup & Breakfast Hosted by Stripe ​ ​ ​ ​ 8:30AM :​ Student Meetup & Breakfast ​ ​ ​ Recruitment Zone :​ Submit Your Resume Online + Find a Job 8.30AM ​ ​ ​ ​ Edie Windsor Tech Pavilion Ida B. Wells Tech Pavilion 9.30AM GLORIOUS MORNING KEYNOTES MAIN STAGE ​:​ CASTRO THEATRE Welcome 9.30AM Leanne Pittsford, Founder & CEO, Lesbians Who Tech, @lepitts ​ Let’s Get This Summit Started :​ Your Emcees 9.40AM ​ ​ ​ Andrea Minkow Danielle Moodie-Mills Andrea Minkow Consulting VP, SKDK & Host of #WokeAF on Sirius XM @andreaminkow @DeeTwoCents ​ Ingrid Nilsen 9.50AM Cultivating Your Public and Private Self in the Age of the Internet YouTube Personality, Cover Girl & UN Ambassador, @ingridnilsen ​ London Breed 10.00AM The State of San Francisco & The Tech Industry.
    [Show full text]
  • Mathnews 142-4
    VOLUME 142 • ISSUE 4 mathNEWS FEBRUARY 28, 2020 9 770705 041004 mastHEAD "DESCRIBE YOUR IDEAL FURSONA." CONTRIBUTORS WERE ASKED TO DETERMINE THEMSELVES IF THIS MEANT THE FURSONA THEY WANTED FOR THEMSELVES OR THE FURSONA THEY WANTED TO FUCK. Readers. DerivingforDick Flamingo. ITSH Like manbearpig, but with a wolf and a beaver. I am sad to say that the scourge of mathNEWS has once again Manwolfbeaver. returned to plague us. Which scourge, you ask? Why, it is Sandwich Expert Speckle from Tuca and Bertie. but the deadly-yet-common phenomenon of poorly timed midterms. Finchey Bertie from Tuca and Bertie. Octopodes Yes, plural midterms. I am told that both MATH 138 and skit Winston Overwatch. MATH 237 decided to hold their midterms at the same time ← Sillycone A carrot. as production night, resulting in an attendance of under 15 swindlED One that also fits the other interpretation of the contributors. (You may shed a tear now, in solidarity with all question. those who were present.) terrifiED Something that won't get me permanently disowned by my parents. Following that, mathNEWS has decided to file a formal itorED mathNEWS hat. complaint with the parts of the university administration confusED Armadillo. who decide when to schedule midterms. It should be easy to find any weekday other than Monday! Especially considering production night isn't even every single Monday, but every other Monday. How is it that they keep lining up so terribly? ARTICLE OF THE ISSUE Anyways, this issue is much shorter than usual. I'm even running out of content to talk about in this mastHEAD.
    [Show full text]
  • Neural Keyphrase Generation with Layer-Wise Coverage Attention
    Select, Extract and Generate: Neural Keyphrase Generation with Layer-wise Coverage Attention Wasi Uddin Ahmady∗, Xiao Baiz, Soomin Leez, Kai-Wei Changy yUniversity of California, Los Angeles, zYahoo Research yfwasiahmad,[email protected] zfxbai,[email protected] Abstract Title: [1] natural language processing technologies Natural language processing techniques have for developing a language learning environment . demonstrated promising results in keyphrase Abstract: [1] so far , computer assisted language generation. However, one of the major chal- learning ( call ) comes in many different flavors lenges in neural keyphrase generation is pro- . [1] our research work focuses on developing an cessing long documents using deep neural net- integrated e learning environment that allows im- works. Generally, documents are truncated be- proving language skills in specific contexts . [1] fore given as inputs to neural networks. Conse- integrated e learning environment means that it quently, the models may miss essential points is a web based solution . , for instance , web conveyed in the target document. To overcome browsers or email clients . [0] it should be accessi- this limitation, we propose SEG-Net, a neural ble . [1] natural language processing ( nlp ) forms keyphrase generation model that is composed the technological basis for developing such a learn- of two major components, (1) a selector that ing framework . [0] the paper gives an overview selects the salient sentences in a document and . [0] therefore , on the one hand , it explains cre- (2) an extractor-generator that jointly extracts ation . [0] on the other hand , it describes existing and generates keyphrases from the selected nlp standards . [0] based on our requirements , the sentences.
    [Show full text]
  • LA IDENTIDAD Π. IDENTIDADES EN EL LABORATORIO Lluís Sallés Diego
    ADVERTIMENT. Lʼaccés als continguts dʼaquesta tesi queda condicionat a lʼacceptació de les condicions dʼús establertes per la següent llicència Creative Commons: http://cat.creativecommons.org/?page_id=184 ADVERTENCIA. El acceso a los contenidos de esta tesis queda condicionado a la aceptación de las condiciones de uso establecidas por la siguiente licencia Creative Commons: http://es.creativecommons.org/blog/licencias/ WARNING. The access to the contents of this doctoral thesis it is limited to the acceptance of the use conditions set by the following Creative Commons license: https://creativecommons.org/licenses/?lang=en Facultad de Filosofía y Letras Departamento de Filología Española Doctorado en Teoría de la Literatura y Literatura Comparada LA IDENTIDAD π. IDENTIDADES EN EL LABORATORIO Autor: Lluís Sallés Diego Tesis para obtener el grado de Doctor en Teoría de la Literatura y Literatura comparada Director: Dr. Antonio Penedo-Picos Barcelona, 2020 1 2 3 4 AGRADECIMIENTOS Este es el fin de un camino que empezamos a recorrer hace ya una década. Nada de lo que aquí les muestro habría sido posible sin el apoyo de mi amiga, compañera y esposa Marta Lusilla, ni el de mi hija Berta Sallés, a la que he visto crecer mientras escribía este ensayo. Gracias a las dos por la comprensión, el aliento, el soporte y el amor incondicional que me habéis procurado. A Antonio Penedo-Picos, mi director, por las largas conversaciones, por sus recomendaciones y por como me ha facilitado el desarrollo de este proyecto e inspirado nuevos puntos de vista. A toda mi familia. Particularmente a mi madre, Montse Diego, que falleció en pleno proceso y que no pudo ver el final.
    [Show full text]
  • 798 000 Véhicules Flashés En Trois
    laces ’offre desp ssentiel»t (voirp.18)! «L’e ussoupha cert de Yo pour le con 798 000 véhicules N°2638 LUNDI 18 MARS 2019 flashés en trois ans Europe 8 Entre le 16 mars 2016, date de la mise en François Bausch. «C'est la démonstration Les Champs-Élysées mis à service des premiers radars, et le 16 mars de que les contrôles automatisés étaient et sac par les gilets jaunes cette année, 798 000 véhicules ont été fla- sont encore nécessaires puisqu'il existe tou- shés sur les routes du Grand-Duché. «Un bi- jours un flagrant non-respect de la loi», lan positif», selon le ministre de la Mobilité, constate la Sécurité routière. PAGE 6 La première de la saison pour Bottas Économie 11 La brasserie de Diekirch est devenue plus écolo People 17 Rebecca Ferguson préfère mener une vie très simple Météo 28 Les pilotes Mercedes ont dominé le Grand Prix d'Australie, Lewis Hamilton (2e) peut arroser Valtteri Bottas (1er) au champagne. MATIN APRÈS-MIDI MELBOURNE Privé de victoire l'année la saison de F1. Le pilote Mercedes dium. Les Ferrari ont déçu, l'Alle- dernière, le Finlandais Valtteri Bot- a dominé son équipier britannique mand Sebastian Vettel et le Moné- 1° 6° tas a retrouvé hier le goût du suc- Lewis Hamilton, le Néerlandais gasque Charles Leclerc ne prenant cès lors du Grand Prix inaugural de Max Verstappen complétant le po- que les 4e et 5e places. PAGE 23 2 Actu LUNDI 18 MARS 2019 / LESSENTIEL.LU Vite lu Mystérieux décès d’un témoin Oublié aux urgences ROUBAIX Un patient de 57 ans a été oublié dans une pièce de des procès contre Berlusconi l’hôpital de Roubaix avant d’être retrouvé dans le coma un jour plus tard.
    [Show full text]