Unsupervised Learning of Cross-Modal Mappings Between
Total Page:16
File Type:pdf, Size:1020Kb
Load more
										Recommended publications
									
								- 
												  Malware Classification with BERTSan Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 5-25-2021 Malware Classification with BERT Joel Lawrence Alvares Follow this and additional works at: https://scholarworks.sjsu.edu/etd_projects Part of the Artificial Intelligence and Robotics Commons, and the Information Security Commons Malware Classification with Word Embeddings Generated by BERT and Word2Vec Malware Classification with BERT Presented to Department of Computer Science San José State University In Partial Fulfillment of the Requirements for the Degree By Joel Alvares May 2021 Malware Classification with Word Embeddings Generated by BERT and Word2Vec The Designated Project Committee Approves the Project Titled Malware Classification with BERT by Joel Lawrence Alvares APPROVED FOR THE DEPARTMENT OF COMPUTER SCIENCE San Jose State University May 2021 Prof. Fabio Di Troia Department of Computer Science Prof. William Andreopoulos Department of Computer Science Prof. Katerina Potika Department of Computer Science 1 Malware Classification with Word Embeddings Generated by BERT and Word2Vec ABSTRACT Malware Classification is used to distinguish unique types of malware from each other. This project aims to carry out malware classification using word embeddings which are used in Natural Language Processing (NLP) to identify and evaluate the relationship between words of a sentence. Word embeddings generated by BERT and Word2Vec for malware samples to carry out multi-class classification. BERT is a transformer based pre- trained natural language processing (NLP) model which can be used for a wide range of tasks such as question answering, paraphrase generation and next sentence prediction. However, the attention mechanism of a pre-trained BERT model can also be used in malware classification by capturing information about relation between each opcode and every other opcode belonging to a malware family.
- 
												  A Survey of Orthographic Information in Machine Translation 3Machine Translation Journal manuscript No. (will be inserted by the editor) A Survey of Orthographic Information in Machine Translation Bharathi Raja Chakravarthi1 ⋅ Priya Rani 1 ⋅ Mihael Arcan2 ⋅ John P. McCrae 1 the date of receipt and acceptance should be inserted later Abstract Machine translation is one of the applications of natural language process- ing which has been explored in different languages. Recently researchers started pay- ing attention towards machine translation for resource-poor languages and closely related languages. A widespread and underlying problem for these machine transla- tion systems is the variation in orthographic conventions which causes many issues to traditional approaches. Two languages written in two different orthographies are not easily comparable, but orthographic information can also be used to improve the machine translation system. This article offers a survey of research regarding orthog- raphy’s influence on machine translation of under-resourced languages. It introduces under-resourced languages in terms of machine translation and how orthographic in- formation can be utilised to improve machine translation. We describe previous work in this area, discussing what underlying assumptions were made, and showing how orthographic knowledge improves the performance of machine translation of under- resourced languages. We discuss different types of machine translation and demon- strate a recent trend that seeks to link orthographic information with well-established machine translation methods. Considerable attention is given to current efforts of cog- nates information at different levels of machine translation and the lessons that can Bharathi Raja Chakravarthi [email protected] Priya Rani [email protected] Mihael Arcan [email protected] John P.
- 
												  Champollion: a Robust Parallel Text Sentence AlignerChampollion: A Robust Parallel Text Sentence Aligner Xiaoyi Ma Linguistic Data Consortium 3600 Market St. Suite 810 Philadelphia, PA 19104 [email protected] Abstract This paper describes Champollion, a lexicon-based sentence aligner designed for robust alignment of potential noisy parallel text. Champollion increases the robustness of the alignment by assigning greater weights to less frequent translated words. Experiments on a manually aligned Chinese – English parallel corpus show that Champollion achieves high precision and recall on noisy data. Champollion can be easily ported to new language pairs. It’s freely available to the public. framework to find the maximum likelihood alignment of 1. Introduction sentences. Parallel text is a very valuable resource for a number The length based approach works remarkably well on of natural language processing tasks, including machine language pairs with high length correlation, such as translation (Brown et al. 1993; Vogel and Tribble 2002; French and English. Its performance degrades quickly, Yamada and Knight 2001;), cross language information however, when the length correlation breaks down, such retrieval, and word disambiguation. as in the case of Chinese and English. Parallel text provides the maximum utility when it is Even with language pairs with high length correlation, sentence aligned. The sentence alignment process maps the Gale-Church algorithm may fail at regions that contain sentences in the source text to their translation. The labor many sentences with similar length. A number of intensive and time consuming nature of manual sentence algorithms, such as (Wu 1994), try to overcome the alignment makes large parallel text corpus development weaknesses of length based approaches by utilizing lexical difficult.
- 
												  Word Representation Or Word Embedding in Persian Texts Siamak Sarmady Erfan Rahmani1 Word representation or word embedding in Persian texts Siamak Sarmady Erfan Rahmani (Abstract) Text processing is one of the sub-branches of natural language processing. Recently, the use of machine learning and neural networks methods has been given greater consideration. For this reason, the representation of words has become very important. This article is about word representation or converting words into vectors in Persian text. In this research GloVe, CBOW and skip-gram methods are updated to produce embedded vectors for Persian words. In order to train a neural networks, Bijankhan corpus, Hamshahri corpus and UPEC corpus have been compound and used. Finally, we have 342,362 words that obtained vectors in all three models for this words. These vectors have many usage for Persian natural language processing. and a bit set in the word index. One-hot vector method 1. INTRODUCTION misses the similarity between strings that are specified by their characters [3]. The Persian language is a Hindi-European language The second method is distributed vectors. The words where the ordering of the words in the sentence is in the same context (neighboring words) may have subject, object, verb (SOV). According to various similar meanings (For example, lift and elevator both structures in Persian language, there are challenges that seem to be accompanied by locks such as up, down, cannot be found in languages such as English [1]. buildings, floors, and stairs). This idea influences the Morphologically the Persian is one of the hardest representation of words using their context. Suppose that languages for the purposes of processing and each word is represented by a v dimensional vector.
- 
												  The Role of Higher-Level Linguistic Features in HMM-Based Speech SynthesisINTERSPEECH 2010 The role of higher-level linguistic features in HMM-based speech synthesis Oliver Watts, Junichi Yamagishi, Simon King Centre for Speech Technology Research, University of Edinburgh, UK [email protected] [email protected] [email protected] Abstract the annotation of a synthesiser’s training data on listeners’ re- action to the speech produced by that synthesiser is still unclear We analyse the contribution of higher-level elements of the lin- to us. In particular, we suspect that features such as pitch ac- guistic specification of a data-driven speech synthesiser to the cent type which might be useful in voice-building if labelled naturalness of the synthetic speech which it generates. The reliably, are under-used in conventional systems because of la- system is trained using various subsets of the full feature-set, belling/prediction errors. For building the systems to be evalu- in which features relating to syntactic category, intonational ated we therefore used data for which hand labelled annotation phrase boundary, pitch accent and boundary tones are selec- of ToBI events is available. This allows us to compare ideal sys- tively removed. Utterances synthesised by the different config- tems built from a corpus where these higher-level features are urations of the system are then compared in a subjective evalu- accurately annotated with more conventional systems that rely ation of their naturalness. exclusively on prediction of these features from text for annota- The work presented forms background analysis for an on- tion of their training data. going set of experiments in performing text-to-speech (TTS) conversion based on shallow features: features that can be triv- 2.
- 
												  Learned in Speech Recognition: Contextual Acoustic Word EmbeddingsLEARNED IN SPEECH RECOGNITION: CONTEXTUAL ACOUSTIC WORD EMBEDDINGS Shruti Palaskar∗, Vikas Raunak∗ and Florian Metze Carnegie Mellon University, Pittsburgh, PA, U.S.A. fspalaska j vraunak j fmetze [email protected] ABSTRACT model [10, 11, 12] trained for direct Acoustic-to-Word (A2W) speech recognition [13]. Using this model, we jointly learn to End-to-end acoustic-to-word speech recognition models have re- automatically segment and classify input speech into individual cently gained popularity because they are easy to train, scale well to words, hence getting rid of the problem of chunking or requiring large amounts of training data, and do not require a lexicon. In addi- pre-defined word boundaries. As our A2W model is trained at the tion, word models may also be easier to integrate with downstream utterance level, we show that we can not only learn acoustic word tasks such as spoken language understanding, because inference embeddings, but also learn them in the proper context of their con- (search) is much simplified compared to phoneme, character or any taining sentence. We also evaluate our contextual acoustic word other sort of sub-word units. In this paper, we describe methods embeddings on a spoken language understanding task, demonstrat- to construct contextual acoustic word embeddings directly from a ing that they can be useful in non-transcription downstream tasks. supervised sequence-to-sequence acoustic-to-word speech recog- Our main contributions in this paper are the following: nition model using the learned attention distribution. On a suite 1. We demonstrate the usability of attention not only for aligning of 16 standard sentence evaluation tasks, our embeddings show words to acoustic frames without any forced alignment but also for competitive performance against a word2vec model trained on the constructing Contextual Acoustic Word Embeddings (CAWE).
- 
												  Language Resource Management System for Asian Wordnet Collaboration and Its Web Service ApplicationLanguage Resource Management System for Asian WordNet Collaboration and Its Web Service Application Virach Sornlertlamvanich1,2, Thatsanee Charoenporn1,2, Hitoshi Isahara3 1Thai Computational Linguistics Laboratory, NICT Asia Research Center, Thailand 2National Electronics and Computer Technology Center, Pathumthani, Thailand 3National Institute of Information and Communications Technology, Japan E-mail: [email protected], [email protected], [email protected] Abstract This paper presents the language resource management system for the development and dissemination of Asian WordNet (AWN) and its web service application. We develop the platform to establish a network for the cross language WordNet development. Each node of the network is designed for maintaining the WordNet for a language. Via the table that maps between each language WordNet and the Princeton WordNet (PWN), the Asian WordNet is realized to visualize the cross language WordNet between the Asian languages. We propose a language resource management system, called WordNet Management System (WNMS), as a distributed management system that allows the server to perform the cross language WordNet retrieval, including the fundamental web service applications for editing, visualizing and language processing. The WNMS is implemented on a web service protocol therefore each node can be independently maintained, and the service of each language WordNet can be called directly through the web service API. In case of cross language implementation, the synset ID (or synset offset) defined by PWN is used to determined the linkage between the languages. English PWN. Therefore, the original structural 1. Introduction information is inherited to the target WordNet through its sense translation and sense ID. The AWN finally Language resource becomes crucial in the recent needs for cross cultural information exchange.
- 
												  Lecture 26 Word Embeddings and Recurrent NetsCS447: Natural Language Processing http://courses.engr.illinois.edu/cs447 Where we’re at Lecture 25: Word Embeddings and neural LMs Lecture 26: Recurrent networks Lecture 26 Lecture 27: Sequence labeling and Seq2Seq Lecture 28: Review for the final exam Word Embeddings and Lecture 29: In-class final exam Recurrent Nets Julia Hockenmaier [email protected] 3324 Siebel Center CS447: Natural Language Processing (J. Hockenmaier) !2 What are neural nets? Simplest variant: single-layer feedforward net For binary Output unit: scalar y classification tasks: Single output unit Input layer: vector x Return 1 if y > 0.5 Recap Return 0 otherwise For multiclass Output layer: vector y classification tasks: K output units (a vector) Input layer: vector x Each output unit " yi = class i Return argmaxi(yi) CS447: Natural Language Processing (J. Hockenmaier) !3 CS447: Natural Language Processing (J. Hockenmaier) !4 Multi-layer feedforward networks Multiclass models: softmax(yi) We can generalize this to multi-layer feedforward nets Multiclass classification = predict one of K classes. Return the class i with the highest score: argmaxi(yi) Output layer: vector y In neural networks, this is typically done by using the softmax N Hidden layer: vector hn function, which maps real-valued vectors in R into a distribution … … … over the N outputs … … … … … …. For a vector z = (z0…zK): P(i) = softmax(zi) = exp(zi) ∕ ∑k=0..K exp(zk) Hidden layer: vector h1 (NB: This is just logistic regression) Input layer: vector x CS447: Natural Language Processing (J. Hockenmaier) !5 CS447: Natural Language Processing (J. Hockenmaier) !6 Neural Language Models LMs define a distribution over strings: P(w1….wk) LMs factor P(w1….wk) into the probability of each word: " P(w1….wk) = P(w1)·P(w2|w1)·P(w3|w1w2)·…· P(wk | w1….wk#1) A neural LM needs to define a distribution over the V words in Neural Language the vocabulary, conditioned on the preceding words.
- 
												  Self-Training Wavenet for TTS in Low-Data RegimesStrawNet: Self-Training WaveNet for TTS in Low-Data Regimes Manish Sharma, Tom Kenter, Rob Clark Google UK fskmanish, tomkenter, [email protected] Abstract is increased. However, it can be seen from their results that the quality degrades when the number of recordings is further Recently, WaveNet has become a popular choice of neural net- decreased. work to synthesize speech audio. Autoregressive WaveNet is To reduce the voice artefacts observed in WaveNet stu- capable of producing high-fidelity audio, but is too slow for dent models trained under a low-data regime, we aim to lever- real-time synthesis. As a remedy, Parallel WaveNet was pro- age both the high-fidelity audio produced by an autoregressive posed, which can produce audio faster than real time through WaveNet, and the faster-than-real-time synthesis capability of distillation of an autoregressive teacher into a feedforward stu- a Parallel WaveNet. We propose a training paradigm, called dent network. A shortcoming of this approach, however, is that StrawNet, which stands for “Self-Training WaveNet”. The key a large amount of recorded speech data is required to produce contribution lies in using high-fidelity speech samples produced high-quality student models, and this data is not always avail- by an autoregressive WaveNet to self-train first a new autore- able. In this paper, we propose StrawNet: a self-training ap- gressive WaveNet and then a Parallel WaveNet model. We refer proach to train a Parallel WaveNet. Self-training is performed to models distilled this way as StrawNet student models. using the synthetic examples generated by the autoregressive We evaluate StrawNet by comparing it to a baseline WaveNet teacher.
- 
												  The Web As a Parallel CorpusThe Web as a Parallel Corpus Philip Resnik∗ Noah A. Smith† University of Maryland Johns Hopkins University Parallel corpora have become an essential resource for work in multilingual natural language processing. In this article, we report on our work using the STRAND system for mining parallel text on the World Wide Web, first reviewing the original algorithm and results and then presenting a set of significant enhancements. These enhancements include the use of supervised learning based on structural features of documents to improve classification performance, a new content- based measure of translational equivalence, and adaptation of the system to take advantage of the Internet Archive for mining parallel text from the Web on a large scale. Finally, the value of these techniques is demonstrated in the construction of a significant parallel corpus for a low-density language pair. 1. Introduction Parallel corpora—bodies of text in parallel translation, also known as bitexts—have taken on an important role in machine translation and multilingual natural language processing. They represent resources for automatic lexical acquisition (e.g., Gale and Church 1991; Melamed 1997), they provide indispensable training data for statistical translation models (e.g., Brown et al. 1990; Melamed 2000; Och and Ney 2002), and they can provide the connection between vocabularies in cross-language information retrieval (e.g., Davis and Dunning 1995; Landauer and Littman 1990; see also Oard 1997). More recently, researchers at Johns Hopkins University and the
- 
												  The RACAI Text-To-Speech Synthesis SystemThe RACAI Text-to-Speech Synthesis System Tiberiu Boroș, Radu Ion, Ștefan Daniel Dumitrescu Research Institute for Artificial Intelligence “Mihai Drăgănescu”, Romanian Academy (RACAI) [email protected], [email protected], [email protected] of standalone Natural Language Processing (NLP) Tools Abstract aimed at enabling text-to-speech (TTS) synthesis for less- This paper describes the RACAI Text-to-Speech (TTS) entry resourced languages. A good example is the case of for the Blizzard Challenge 2013. The development of the Romanian, a language which poses a lot of challenges for TTS RACAI TTS started during the Metanet4U project and the synthesis mainly because of its rich morphology and its system is currently part of the METASHARE platform. This reduced support in terms of freely available resources. Initially paper describes the work carried out for preparing the RACAI all the tools were standalone, but their design allowed their entry during the Blizzard Challenge 2013 and provides a integration into a single package and by adding a unit selection detailed description of our system and future development speech synthesis module based on the Pitch Synchronous directions. Overlap-Add (PSOLA) algorithm we were able to create a fully independent text-to-speech synthesis system that we refer Index Terms: speech synthesis, unit selection, concatenative to as RACAI TTS. 1. Introduction A considerable progress has been made since the start of the project, but our system still requires development and Text-to-speech (TTS) synthesis is a complex process that although considered premature, our participation in the addresses the task of converting arbitrary text into voice.
- 
												  Unsupervised Speech Representation Learning Using Wavenet Autoencoders Jan Chorowski, Ron J1 Unsupervised speech representation learning using WaveNet autoencoders Jan Chorowski, Ron J. Weiss, Samy Bengio, Aaron¨ van den Oord Abstract—We consider the task of unsupervised extraction speaker gender and identity, from phonetic content, properties of meaningful latent representations of speech by applying which are consistent with internal representations learned autoencoding neural networks to speech waveforms. The goal by speech recognizers [13], [14]. Such representations are is to learn a representation able to capture high level semantic content from the signal, e.g. phoneme identities, while being desired in several tasks, such as low resource automatic speech invariant to confounding low level details in the signal such as recognition (ASR), where only a small amount of labeled the underlying pitch contour or background noise. Since the training data is available. In such scenario, limited amounts learned representation is tuned to contain only phonetic content, of data may be sufficient to learn an acoustic model on the we resort to using a high capacity WaveNet decoder to infer representation discovered without supervision, but insufficient information discarded by the encoder from previous samples. Moreover, the behavior of autoencoder models depends on the to learn the acoustic model and a data representation in a fully kind of constraint that is applied to the latent representation. supervised manner [15], [16]. We compare three variants: a simple dimensionality reduction We focus on representations learned with autoencoders bottleneck, a Gaussian Variational Autoencoder (VAE), and a applied to raw waveforms and spectrogram features and discrete Vector Quantized VAE (VQ-VAE). We analyze the quality investigate the quality of learned representations on LibriSpeech of learned representations in terms of speaker independence, the ability to predict phonetic content, and the ability to accurately re- [17].