Factoid and Open-Ended Question Answering with BERT in the Museum Domain
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Malware Classification with BERT
San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 5-25-2021 Malware Classification with BERT Joel Lawrence Alvares Follow this and additional works at: https://scholarworks.sjsu.edu/etd_projects Part of the Artificial Intelligence and Robotics Commons, and the Information Security Commons Malware Classification with Word Embeddings Generated by BERT and Word2Vec Malware Classification with BERT Presented to Department of Computer Science San José State University In Partial Fulfillment of the Requirements for the Degree By Joel Alvares May 2021 Malware Classification with Word Embeddings Generated by BERT and Word2Vec The Designated Project Committee Approves the Project Titled Malware Classification with BERT by Joel Lawrence Alvares APPROVED FOR THE DEPARTMENT OF COMPUTER SCIENCE San Jose State University May 2021 Prof. Fabio Di Troia Department of Computer Science Prof. William Andreopoulos Department of Computer Science Prof. Katerina Potika Department of Computer Science 1 Malware Classification with Word Embeddings Generated by BERT and Word2Vec ABSTRACT Malware Classification is used to distinguish unique types of malware from each other. This project aims to carry out malware classification using word embeddings which are used in Natural Language Processing (NLP) to identify and evaluate the relationship between words of a sentence. Word embeddings generated by BERT and Word2Vec for malware samples to carry out multi-class classification. BERT is a transformer based pre- trained natural language processing (NLP) model which can be used for a wide range of tasks such as question answering, paraphrase generation and next sentence prediction. However, the attention mechanism of a pre-trained BERT model can also be used in malware classification by capturing information about relation between each opcode and every other opcode belonging to a malware family. -
Words That Work: It's Not What You Say, It's What People Hear
ï . •,";,£ CASL M T. ^oÛNTAE À SUL'S, REVITA 1ENT, HASSLE- NT_ MAIN STR " \CCOUNTA ;, INNOVAT MLUE, CASL : REVITA JOVATh IE, CASL )UNTAE CO M M XIMEN1 VlTA • Ml ^re aW c^Pti ( °rds *cc Po 0 ^rof°>lish lu*t* >nk Lan <^l^ gua a ul Vic r ntz °ko Ono." - Somehow, W( c< Words are enorm i Jheer pleasure of CJ ftj* * - ! love laag^ liant about Words." gM °rder- Franl< Luntz * bril- 'Frank Luntz understands the power of words to move public Opinion and communicate big ideas. Any Democrat who writes off his analysis and decades of experience just because he works for the other side is making a big mistake. His les sons don't have a party label. The only question is, where s our Frank Luntz^^^^^^^™ îy are some people so much better than others at talking their way into a job or nit of trouble? What makes some advertising jingles cut through the clutter of our crowded memories? What's behind winning campaign slogans and career-ending political blunders? Why do some speeches resonate and endure while others are forgotten moments after they are given? The answers lie in the way words are used to influence and motivate, the way they connect thought and emotion. And no person knows more about the intersection of words and deeds than language architect and public-opinion guru Dr. Frank Luntz. In Words That Work, Dr. Luntz not only raises the curtain on the craft of effective language, but also offers priceless insight on how to find and use the right words to get what you want out of life. -
Fake News Detection in Social Media
Fake news detection in social media Kasseropoulos Dimitrios-Panagiotis SID: 3308190011 SCHOOL OF SCIENCE & TECHNOLOGY A thesis submitted for the degree of Master of Science (MSc) in Data Science JANUARY, 2021 THESSALONIKI - GREECE i Fake news detection in social media Kasseropoulos Dimitrios-Panagiotis SID: 3308190011 Supervisor: Assoc. Prof. Christos Tjortjis Supervising Committee Members: Prof. Panagiotis Bozanis Dr. Leonidas Akritidis SCHOOL OF SCIENCE & TECHNOLOGY A thesis submitted for the degree of Master of Science (MSc) in Data Science JANUARY, 2021 THESSALONIKI - GREECE ii Abstract This dissertation was written as a part of the MSc in Data Science at the International Hellenic University. The easy propagation and access to information in the web has the potential to become a serious issue when it comes to disinformation. The term “fake news” is describing the intentional propagation of fake news with the intent to mislead and harm the public, and has gained more attention since the U.S elections of 2016. Recent studies have used machine learning techniques to tackle it. This thesis reviews the style-based machine learning approach which relies on the textual information of news, such as the manually extraction of lexical features from the text (e.g. part of speech counts), and testing the performance of both classic and non-classic (artificial neural networks) algorithms. We have managed to find a subset of best performing linguistic features, using information-based metrics, which also tend to agree with the already existing literature. Also, we combined the Name Entity Recognition (NER) functionality of spacy’s library with the FP Growth algorithm to gain a deeper perspective of the name entities used in the two classes. -
Bros:Apre-Trained Language Model for Understanding Textsin Document
Under review as a conference paper at ICLR 2021 BROS: A PRE-TRAINED LANGUAGE MODEL FOR UNDERSTANDING TEXTS IN DOCUMENT Anonymous authors Paper under double-blind review ABSTRACT Understanding document from their visual snapshots is an emerging and chal- lenging problem that requires both advanced computer vision and NLP methods. Although the recent advance in OCR enables the accurate extraction of text seg- ments, it is still challenging to extract key information from documents due to the diversity of layouts. To compensate for the difficulties, this paper introduces a pre-trained language model, BERT Relying On Spatiality (BROS), that represents and understands the semantics of spatially distributed texts. Different from pre- vious pre-training methods on 1D text, BROS is pre-trained on large-scale semi- structured documents with a novel area-masking strategy while efficiently includ- ing the spatial layout information of input documents. Also, to generate structured outputs in various document understanding tasks, BROS utilizes a powerful graph- based decoder that can capture the relation between text segments. BROS achieves state-of-the-art results on four benchmark tasks: FUNSD, SROIE*, CORD, and SciTSR. Our experimental settings and implementation codes will be publicly available. 1 INTRODUCTION Document intelligence (DI)1, which understands industrial documents from their visual appearance, is a critical application of AI in business. One of the important challenges of DI is a key information extraction task (KIE) (Huang et al., 2019; Jaume et al., 2019; Park et al., 2019) that extracts struc- tured information from documents such as financial reports, invoices, business emails, insurance quotes, and many others. -
Information Extraction Based on Named Entity for Tourism Corpus
Information Extraction based on Named Entity for Tourism Corpus Chantana Chantrapornchai Aphisit Tunsakul Dept. of Computer Engineering Dept. of Computer Engineering Faculty of Engineering Faculty of Engineering Kasetsart University Kasetsart University Bangkok, Thailand Bangkok, Thailand [email protected] [email protected] Abstract— Tourism information is scattered around nowa- The ontology is extracted based on HTML web structure, days. To search for the information, it is usually time consuming and the corpus is based on WordNet. For these approaches, to browse through the results from search engine, select and the time consuming process is the annotation which is to view the details of each accommodation. In this paper, we present a methodology to extract particular information from annotate the type of name entity. In this paper, we target at full text returned from the search engine to facilitate the users. the tourism domain, and aim to extract particular information Then, the users can specifically look to the desired relevant helping for ontology data acquisition. information. The approach can be used for the same task in We present the framework for the given named entity ex- other domains. The main steps are 1) building training data traction. Starting from the web information scraping process, and 2) building recognition model. First, the tourism data is gathered and the vocabularies are built. The raw corpus is used the data are selected based on the HTML tag for corpus to train for creating vocabulary embedding. Also, it is used building. The data is used for model creation for automatic for creating annotated data. -
NLP with BERT: Sentiment Analysis Using SAS® Deep Learning and Dlpy Doug Cairns and Xiangxiang Meng, SAS Institute Inc
Paper SAS4429-2020 NLP with BERT: Sentiment Analysis Using SAS® Deep Learning and DLPy Doug Cairns and Xiangxiang Meng, SAS Institute Inc. ABSTRACT A revolution is taking place in natural language processing (NLP) as a result of two ideas. The first idea is that pretraining a deep neural network as a language model is a good starting point for a range of NLP tasks. These networks can be augmented (layers can be added or dropped) and then fine-tuned with transfer learning for specific NLP tasks. The second idea involves a paradigm shift away from traditional recurrent neural networks (RNNs) and toward deep neural networks based on Transformer building blocks. One architecture that embodies these ideas is Bidirectional Encoder Representations from Transformers (BERT). BERT and its variants have been at or near the top of the leaderboard for many traditional NLP tasks, such as the general language understanding evaluation (GLUE) benchmarks. This paper provides an overview of BERT and shows how you can create your own BERT model by using SAS® Deep Learning and the SAS DLPy Python package. It illustrates the effectiveness of BERT by performing sentiment analysis on unstructured product reviews submitted to Amazon. INTRODUCTION Providing a computer-based analog for the conceptual and syntactic processing that occurs in the human brain for spoken or written communication has proven extremely challenging. As a simple example, consider the abstract for this (or any) technical paper. If well written, it should be a concise summary of what you will learn from reading the paper. As a reader, you expect to see some or all of the following: • Technical context and/or problem • Key contribution(s) • Salient result(s) If you were tasked to create a computer-based tool for summarizing papers, how would you translate your expectations as a reader into an implementable algorithm? This is the type of problem that the field of natural language processing (NLP) addresses. -
Combatting White Nationalism: Lessons from Marx by Andrew Kliman
Combatting White Nationalism: Lessons from Marx by Andrew Kliman October 2, 2017 version Abstract This essay seeks to draw lessons from Karl Marx’s writings and practice that can help combat Trumpism and other expressions of white nationalism. The foremost lesson is that fighting white nationalism in the tradition of Marx entails the perspective of solidarizing with the “white working class” by decisively defeating Trumpism and other far-right forces. Their defeat will help liberate the “white working class” from the grip of reaction and thereby spur the independent emancipatory self-development of working people as a whole. The anti-neoliberal “left” claims that Trump’s electoral victory was due to an uprising of the “white working class” against a rapacious neoliberalism that has caused its income to stagnate for decades. However, working-class income (measured reasonably) did not stagnate, nor is “economic distress” a source of Trump’s surprisingly strong support from “working-class” whites. And by looking back at the presidential campaigns of George Wallace between 1964 and 1972, it becomes clear that Trumpism is a manifestation of a long-standing white nationalist strain in US politics, not a response to neoliberalism or globalization. Marx differed from the anti-neoliberal “left” in that he sought to encourage the “independent movement of the workers” toward human emancipation, not further the political interests of “the left.” Their different responses to white nationalism are rooted in this general difference. Detailed analysis of Marx’s writings and activity around the US Civil War and the Irish independence struggle against England reveals that Marx stood for the defeat of the Confederacy, and the defeat of England, largely because he anticipated that these defeats would stimulate the independent emancipatory self-development of the working class. -
Online Media and the 2016 US Presidential Election
Partisanship, Propaganda, and Disinformation: Online Media and the 2016 U.S. Presidential Election The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters Citation Faris, Robert M., Hal Roberts, Bruce Etling, Nikki Bourassa, Ethan Zuckerman, and Yochai Benkler. 2017. Partisanship, Propaganda, and Disinformation: Online Media and the 2016 U.S. Presidential Election. Berkman Klein Center for Internet & Society Research Paper. Citable link http://nrs.harvard.edu/urn-3:HUL.InstRepos:33759251 Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA AUGUST 2017 PARTISANSHIP, Robert Faris Hal Roberts PROPAGANDA, & Bruce Etling Nikki Bourassa DISINFORMATION Ethan Zuckerman Yochai Benkler Online Media & the 2016 U.S. Presidential Election ACKNOWLEDGMENTS This paper is the result of months of effort and has only come to be as a result of the generous input of many people from the Berkman Klein Center and beyond. Jonas Kaiser and Paola Villarreal expanded our thinking around methods and interpretation. Brendan Roach provided excellent research assistance. Rebekah Heacock Jones helped get this research off the ground, and Justin Clark helped bring it home. We are grateful to Gretchen Weber, David Talbot, and Daniel Dennis Jones for their assistance in the production and publication of this study. This paper has also benefited from contributions of many outside the Berkman Klein community. The entire Media Cloud team at the Center for Civic Media at MIT’s Media Lab has been essential to this research. -
Putin's New Russia
PUTIN’S NEW RUSSIA Edited by Jon Hellevig and Alexandre Latsa With an Introduction by Peter Lavelle Contributors: Patrick Armstrong, Mark Chapman, Aleksandr Grishin, Jon Hellevig, Anatoly Karlin, Eric Kraus, Alexandre Latsa, Nils van der Vegte, Craig James Willy Publisher: Kontinent USA September, 2012 First edition published 2012 First edition published 2012 Copyright to Jon Hellevig and Alexander Latsa Publisher: Kontinent USA 1800 Connecticut Avenue, NW Washington, DC 20009 [email protected] www.us-russia.org/kontinent Cover by Alexandra Mozilova on motive of painting by Ilya Komov Printed at Printing house "Citius" ISBN 978-0-9883137-0-5 This is a book authored by independent minded Western observers who have real experience of how Russia has developed after the failed perestroika since Putin first became president in 2000. Common sense warning: The book you are about to read is dangerous. If you are from the English language media sphere, virtually everything you may think you know about contemporary Rus- sia; its political system, leaders, economy, population, so-called opposition, foreign policy and much more is either seriously flawed or just plain wrong. This has not happened by accident. This book explains why. This book is also about gross double standards, hypocrisy, and venal stupidity with western media playing the role of willing accomplice. After reading this interesting tome, you might reconsider everything you “learn” from mainstream media about Russia and the world. Contents PETER LAVELLE ............................................................................................1 -
Unified Language Model Pre-Training for Natural
Unified Language Model Pre-training for Natural Language Understanding and Generation Li Dong∗ Nan Yang∗ Wenhui Wang∗ Furu Wei∗† Xiaodong Liu Yu Wang Jianfeng Gao Ming Zhou Hsiao-Wuen Hon Microsoft Research {lidong1,nanya,wenwan,fuwei}@microsoft.com {xiaodl,yuwan,jfgao,mingzhou,hon}@microsoft.com Abstract This paper presents a new UNIfied pre-trained Language Model (UNILM) that can be fine-tuned for both natural language understanding and generation tasks. The model is pre-trained using three types of language modeling tasks: unidirec- tional, bidirectional, and sequence-to-sequence prediction. The unified modeling is achieved by employing a shared Transformer network and utilizing specific self-attention masks to control what context the prediction conditions on. UNILM compares favorably with BERT on the GLUE benchmark, and the SQuAD 2.0 and CoQA question answering tasks. Moreover, UNILM achieves new state-of- the-art results on five natural language generation datasets, including improving the CNN/DailyMail abstractive summarization ROUGE-L to 40.51 (2.04 absolute improvement), the Gigaword abstractive summarization ROUGE-L to 35.75 (0.86 absolute improvement), the CoQA generative question answering F1 score to 82.5 (37.1 absolute improvement), the SQuAD question generation BLEU-4 to 22.12 (3.75 absolute improvement), and the DSTC7 document-grounded dialog response generation NIST-4 to 2.67 (human performance is 2.65). The code and pre-trained models are available at https://github.com/microsoft/unilm. 1 Introduction Language model (LM) pre-training has substantially advanced the state of the art across a variety of natural language processing tasks [8, 29, 19, 31, 9, 1]. -
Question Answering by Bert
Running head: QUESTION AND ANSWERING USING BERT 1 Question and Answering Using BERT Suman Karanjit, Computer SCience Major Minnesota State University Moorhead Link to GitHub https://github.com/skaranjit/BertQA QUESTION AND ANSWERING USING BERT 2 Table of Contents ABSTRACT .................................................................................................................................................... 3 INTRODUCTION .......................................................................................................................................... 4 SQUAD ............................................................................................................................................................ 5 BERT EXPLAINED ...................................................................................................................................... 5 WHAT IS BERT? .......................................................................................................................................... 5 ARCHITECTURE ............................................................................................................................................ 5 INPUT PROCESSING ...................................................................................................................................... 6 GETTING ANSWER ........................................................................................................................................ 8 SETTING UP THE ENVIRONMENT. .................................................................................................... -
Automatic Question Answering: Beyond the Factoid
Automatic Question Answering: Beyond the Factoid Radu Soricut Eric Brill Information Sciences Institute Microsoft Research University of Southern California One Microsoft Way 4676 Admiralty Way Redmond, WA 98052, USA Marina del Rey, CA 90292, USA [email protected] [email protected] as readily be found by simply using a good search en- Abstract gine. It follows that there is a good economic incentive in moving the QA task to a more general level: it is In this paper we describe and evaluate a Ques- likely that a system able to answer complex questions of tion Answering system that goes beyond an- the type people generally and/or frequently ask has swering factoid questions. We focus on FAQ- greater potential impact than one restricted to answering like questions and answers, and build our sys- only factoid questions. A natural move is to recast the tem around a noisy-channel architecture which question answering task to handling questions people exploits both a language model for answers frequently ask or want answers for, as seen in Frequently and a transformation model for an- Asked Questions (FAQ) lists. These questions are some- swer/question terms, trained on a corpus of 1 times factoid questions (such as, “What is Scotland's million question/answer pairs collected from national costume?”), but in general are more complex the Web. questions (such as, “How does a film qualify for an Academy Award?”, which requires an answer along the following lines: “A feature film must screen in a Los Angeles County theater in 35 or 70mm or in a 24-frame 1 Introduction progressive scan digital format suitable for exhibiting in The Question Answering (QA) task has received a great existing commercial digital cinema sites for paid admis- deal of attention from the Computational Linguistics sion for seven consecutive days.