Auto-Encoding Dictionary Definitions Into Consistent Word Embeddings

Total Page:16

File Type:pdf, Size:1020Kb

Auto-Encoding Dictionary Definitions Into Consistent Word Embeddings Auto-Encoding Dictionary Definitions into Consistent Word Embeddings Tom Bosc Pascal Vincent Mila, Université de Montréal Mila, Université de Montréal, CIFAR, [email protected] Facebook AI Research [email protected] Abstract ilarity and especially in distinguishing it from re- latedness (Hill et al., 2016). Monolingual dictionaries are widespread and It is useful to specialize word embeddings to fo- semantically rich resources. This paper cus on either relation in order to improve perfor- presents a simple model that learns to com- mance on specific downstream tasks. For instance, pute word embeddings by processing dictio- Kiela et al.(2015) report that improvements on nary definitions and trying to reconstruct them. It exploits the inherent recursivity of dictio- relatedness benchmarks also yield improvements naries by encouraging consistency between the on document classification. In the other direction, representations it uses as inputs and the repre- embeddings learned by neural machine translation sentations it produces as outputs. The result- models capture similarity better than distributional ing embeddings are shown to capture seman- unsupervised objectives (Hill et al., 2014). tic similarity better than regular distributional There is a wealth of methods that postprocess methods and other dictionary-based methods. embeddings to improve or specialize them, such In addition, the method shows strong perfor- mance when trained exclusively on dictionary as retrofitting (Faruqui et al., 2014). On similarity data and generalizes in one shot. benchmarks, they are able to reach correlation co- efficients close to inter-annotator agreements. But 1 Introduction these methods rely on additional resources such as paraphrase databases (Wieting et al., 2016) or Dense, low-dimensional, real-valued vector repre- graphs of lexical relations such as synonymy, hy- sentations of words known as word embeddings pernymy, and their converse (Mrkšic´ et al., 2017). have proven very useful for NLP tasks (Turian Rather than relying on such curated lexical re- et al., 2010). They can be learned as a by-product sources that are not readily available for the ma- of solving a particular task (Collobert et al., 2011). jority of languages, we propose a method capa- Alternatively, one can pretrain generic embed- ble of improving embeddings by leveraging the dings based on co-occurence counts or using an more common resource of monolingual dictionar- unsupervised criterion such as predicting nearby ies.1 Lexical databases such as WordNet (Fell- words (Bengio et al., 2003; Mikolov et al., 2013). baum, 1998) are often built from dictionary defi- These methods implicitly rely on the distributional nitions, as was proposed earlier by Amsler(1980). hypothesis (Harris, 1954; Sahlgren, 2008), which We propose to bypass the process of explicitly states that words that occur in similar contexts tend building a lexical database – during which infor- to have similar meanings. mation is structured but information is also lost – It is common to study the relationships captured and instead directly use its detailed source: dictio- by word representations in terms of either simi- nary definitions. The goal is to obtain better rep- larity or relatedness (Hill et al., 2016). “Coffee” resentations for more languages with less effort. is related to “cup” as coffee is a beverage often The ability to process new definitions is also de- drunk in a cup, but “coffee” is not similar to “cup” sirable for future natural language understanding in that coffee is a beverage and cup is a container. systems. In a dialogue, a human might want to ex- Methods relying on the distributional hypothesis plain a new term by explaining it in his own words, often capture relatedness very well, reaching hu- 1See AppendixA for a list of online monolingual dictio- man performance, but fare worse in capturing sim- naries. and the chatbot should understand it. Similarly, hidden states output of LSTM embeddings question-answering systems should also be able to definition grasp definitions of technical terms that often oc- embedding cur in scientific writing. We expect the embedding of a word to rep- resent its meaning compactly. For interpretabil- ity purposes, it would be desirable to be able to min consistency penalty generate a definition from that embedding, as a simple way to verify what information it captured. Case language in point: to analyse word embeddings, Noraset model et al.(2017) used RNNs to produce definitions input from pretrained embeddings, manually annotated embeddings the errors in the generated definitions, and found min reconstruction out that more than half of the wrong definitions error fit either the antonyms of the defined words, their definition hypernyms, or related but different words. This prove: provide evidence for points in the same direction as the results of in- trinsic evaluations of word embeddings: lexical Figure 1: Overview of the CPAE model. relationships such as lexical entailment, similar- ity and relatedness are conflated in these embed- training criterion is built on the following princi- dings. It also suggests a new criterion for evalu- ple: we want the model to be able to recover the ating word representations, or even learning them: definition from which the representation was built. they should contain the necessary information to This objective should produce similar embeddings reproduce their definition (to some degree). for words which have similar definitions. Our hy- In this work, we propose a simple model that pothesis is that this will help capture semantic sim- exploits this criterion. The model consists of a def- ilarity, as opposed to relatedness. Reusing the pre- inition autoencoder: an LSTM processes the def- vious example, “coffee” and “cup” should get dif- inition of a word to yield its corresponding word ferent representations in virtue of having very dif- embedding. Given this embedding, the decoder at- ferent definitions, while “coffee” and “tea” should tempts to reconstruct the bag-of-words represen- get similar representations as they are both defined tation of the definition. Importantly, to address as beverages and plants. and leverage the recursive nature of dictionaries We chose to compute a single embedding per – the fact that words that are used inside a def- word in order to avoid having to disambiguate inition have their own associated definition – we word senses. Indeed, word sense disambiguation train this model with a consistency penalty that remains a challenging open problem with mixed ensures proximity of the embeddings produced by success on downstream task applications (Navigli, the LSTM and those that are used by the LSTM. 2009). Also, recent papers have shown that a sin- Our approach is self-contained: it yields good gle word vector can capture polysemy and that representations when trained on nothing but dic- having several vectors per word is not strictly nec- tionary data. Alternatively, it can also leverage essary (Li and Jurafsky, 2015)(Yaghoobzadeh and existing word embeddings and is then especially Schütze, 2016). Thus, when a word has several apt at specializing them for the similarity relation. definitions, we concatenate them to produce a sin- Finally, it is also extremely data-efficient, as it per- gle embedding. mits to create representations of new words in one shot from a short definition. 2.2 Autoencoder model 2 Model Let VD be the set of all words that are used in definitions and VK the set of all words that are 2.1 Setting and motivation K defined. We let w 2 V be a word and Dw = We suppose that we have access to a dictionary (Dw;1;:::;Dw;T ) be its definition, where Dw;t is that maps words to one or several definitions. Def- the index of a word in vocabulary VD. We en- initions themselves are sequences of words. Our code such a definition Dw by processing it with an LSTM (Hochreiter and Schmidhuber, 1997). scheme that brings the input embeddings closer The LSTM is parameterized by Ω and a matrix to the definition embeddings. We call this term D th E of size jV j × m, whose i row Ei contains an a consistency penalty because its goal is to to en- m-dimensional input embedding for the ith word sure that the embeddings used by the encoder (in- of VD. These input embeddings can either be put embeddings) and the embeddings produced by learned by the model or be fixed to a pretrained the encoder (definition embeddings) are consistent embedding. The last hidden state computed by with each other. It is implemented as this LSTM is then transformed linearly to yield an X m-dimensional definition embedding h. Thus the Jp(θ) = d(Ew; fθ(Dw)) encoder whose parameters are θ = fE; Ω; W; bg w2VD\VK computes this embedding h as where d is a distance. In our experiments, we choose d to be the Euclidian distance. The penalty h = f (D ) = W LSTM (D ) + b: θ w E;Ω w is only applied to some words because VD 6= VK. The subsequent decoder can be seen as a condi- Indeed, some words are defined but are not used in tional language model trained by maximum likeli- definitions and some words are used in definitions but not defined. In particular, inflected words are hood to regenerate definition Dw given definition not defined. To balance the two terms, we intro- embedding h = fθ(Dw). We use a simple con- ditional unigram model with a linear parametriza- duce two hyperparameters λ, α ≥ 0 and the com- tion θ0 = fE0; b0g where E0 is a jVDj × m matrix plete objective is b0 2 and is a bias vector. 0 0 We maximize the log-probability of definition J(θ ; θ) = αJr(θ ; θ) + λJp(θ): Dw under that model: We call the model CPAE, for Consistency Pe- nalized AutoEncoder when α > 0 and λ > 0 (see X Figure1).
Recommended publications
  • Malware Classification with BERT
    San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 5-25-2021 Malware Classification with BERT Joel Lawrence Alvares Follow this and additional works at: https://scholarworks.sjsu.edu/etd_projects Part of the Artificial Intelligence and Robotics Commons, and the Information Security Commons Malware Classification with Word Embeddings Generated by BERT and Word2Vec Malware Classification with BERT Presented to Department of Computer Science San José State University In Partial Fulfillment of the Requirements for the Degree By Joel Alvares May 2021 Malware Classification with Word Embeddings Generated by BERT and Word2Vec The Designated Project Committee Approves the Project Titled Malware Classification with BERT by Joel Lawrence Alvares APPROVED FOR THE DEPARTMENT OF COMPUTER SCIENCE San Jose State University May 2021 Prof. Fabio Di Troia Department of Computer Science Prof. William Andreopoulos Department of Computer Science Prof. Katerina Potika Department of Computer Science 1 Malware Classification with Word Embeddings Generated by BERT and Word2Vec ABSTRACT Malware Classification is used to distinguish unique types of malware from each other. This project aims to carry out malware classification using word embeddings which are used in Natural Language Processing (NLP) to identify and evaluate the relationship between words of a sentence. Word embeddings generated by BERT and Word2Vec for malware samples to carry out multi-class classification. BERT is a transformer based pre- trained natural language processing (NLP) model which can be used for a wide range of tasks such as question answering, paraphrase generation and next sentence prediction. However, the attention mechanism of a pre-trained BERT model can also be used in malware classification by capturing information about relation between each opcode and every other opcode belonging to a malware family.
    [Show full text]
  • Training Autoencoders by Alternating Minimization
    Under review as a conference paper at ICLR 2018 TRAINING AUTOENCODERS BY ALTERNATING MINI- MIZATION Anonymous authors Paper under double-blind review ABSTRACT We present DANTE, a novel method for training neural networks, in particular autoencoders, using the alternating minimization principle. DANTE provides a distinct perspective in lieu of traditional gradient-based backpropagation techniques commonly used to train deep networks. It utilizes an adaptation of quasi-convex optimization techniques to cast autoencoder training as a bi-quasi-convex optimiza- tion problem. We show that for autoencoder configurations with both differentiable (e.g. sigmoid) and non-differentiable (e.g. ReLU) activation functions, we can perform the alternations very effectively. DANTE effortlessly extends to networks with multiple hidden layers and varying network configurations. In experiments on standard datasets, autoencoders trained using the proposed method were found to be very promising and competitive to traditional backpropagation techniques, both in terms of quality of solution, as well as training speed. 1 INTRODUCTION For much of the recent march of deep learning, gradient-based backpropagation methods, e.g. Stochastic Gradient Descent (SGD) and its variants, have been the mainstay of practitioners. The use of these methods, especially on vast amounts of data, has led to unprecedented progress in several areas of artificial intelligence. On one hand, the intense focus on these techniques has led to an intimate understanding of hardware requirements and code optimizations needed to execute these routines on large datasets in a scalable manner. Today, myriad off-the-shelf and highly optimized packages exist that can churn reasonably large datasets on GPU architectures with relatively mild human involvement and little bootstrap effort.
    [Show full text]
  • Turbo Autoencoder: Deep Learning Based Channel Codes for Point-To-Point Communication Channels
    Turbo Autoencoder: Deep learning based channel codes for point-to-point communication channels Hyeji Kim Himanshu Asnani Yihan Jiang Samsung AI Center School of Technology ECE Department Cambridge and Computer Science University of Washington Cambridge, United Tata Institute of Seattle, United States Kingdom Fundamental Research [email protected] [email protected] Mumbai, India [email protected] Sewoong Oh Pramod Viswanath Sreeram Kannan Allen School of ECE Department ECE Department Computer Science & University of Illinois at University of Washington Engineering Urbana Champaign Seattle, United States University of Washington Illinois, United States [email protected] Seattle, United States [email protected] [email protected] Abstract Designing codes that combat the noise in a communication medium has remained a significant area of research in information theory as well as wireless communica- tions. Asymptotically optimal channel codes have been developed by mathemati- cians for communicating under canonical models after over 60 years of research. On the other hand, in many non-canonical channel settings, optimal codes do not exist and the codes designed for canonical models are adapted via heuristics to these channels and are thus not guaranteed to be optimal. In this work, we make significant progress on this problem by designing a fully end-to-end jointly trained neural encoder and decoder, namely, Turbo Autoencoder (TurboAE), with the following contributions: (a) under moderate block lengths, TurboAE approaches state-of-the-art performance under canonical channels; (b) moreover, TurboAE outperforms the state-of-the-art codes under non-canonical settings in terms of reliability. TurboAE shows that the development of channel coding design can be automated via deep learning, with near-optimal performance.
    [Show full text]
  • Double Backpropagation for Training Autoencoders Against Adversarial Attack
    1 Double Backpropagation for Training Autoencoders against Adversarial Attack Chengjin Sun, Sizhe Chen, and Xiaolin Huang, Senior Member, IEEE Abstract—Deep learning, as widely known, is vulnerable to adversarial samples. This paper focuses on the adversarial attack on autoencoders. Safety of the autoencoders (AEs) is important because they are widely used as a compression scheme for data storage and transmission, however, the current autoencoders are easily attacked, i.e., one can slightly modify an input but has totally different codes. The vulnerability is rooted the sensitivity of the autoencoders and to enhance the robustness, we propose to adopt double backpropagation (DBP) to secure autoencoder such as VAE and DRAW. We restrict the gradient from the reconstruction image to the original one so that the autoencoder is not sensitive to trivial perturbation produced by the adversarial attack. After smoothing the gradient by DBP, we further smooth the label by Gaussian Mixture Model (GMM), aiming for accurate and robust classification. We demonstrate in MNIST, CelebA, SVHN that our method leads to a robust autoencoder resistant to attack and a robust classifier able for image transition and immune to adversarial attack if combined with GMM. Index Terms—double backpropagation, autoencoder, network robustness, GMM. F 1 INTRODUCTION N the past few years, deep neural networks have been feature [9], [10], [11], [12], [13], or network structure [3], [14], I greatly developed and successfully used in a vast of fields, [15]. such as pattern recognition, intelligent robots, automatic Adversarial attack and its defense are revolving around a control, medicine [1]. Despite the great success, researchers small ∆x and a big resulting difference between f(x + ∆x) have found the vulnerability of deep neural networks to and f(x).
    [Show full text]
  • Learned in Speech Recognition: Contextual Acoustic Word Embeddings
    LEARNED IN SPEECH RECOGNITION: CONTEXTUAL ACOUSTIC WORD EMBEDDINGS Shruti Palaskar∗, Vikas Raunak∗ and Florian Metze Carnegie Mellon University, Pittsburgh, PA, U.S.A. fspalaska j vraunak j fmetze [email protected] ABSTRACT model [10, 11, 12] trained for direct Acoustic-to-Word (A2W) speech recognition [13]. Using this model, we jointly learn to End-to-end acoustic-to-word speech recognition models have re- automatically segment and classify input speech into individual cently gained popularity because they are easy to train, scale well to words, hence getting rid of the problem of chunking or requiring large amounts of training data, and do not require a lexicon. In addi- pre-defined word boundaries. As our A2W model is trained at the tion, word models may also be easier to integrate with downstream utterance level, we show that we can not only learn acoustic word tasks such as spoken language understanding, because inference embeddings, but also learn them in the proper context of their con- (search) is much simplified compared to phoneme, character or any taining sentence. We also evaluate our contextual acoustic word other sort of sub-word units. In this paper, we describe methods embeddings on a spoken language understanding task, demonstrat- to construct contextual acoustic word embeddings directly from a ing that they can be useful in non-transcription downstream tasks. supervised sequence-to-sequence acoustic-to-word speech recog- Our main contributions in this paper are the following: nition model using the learned attention distribution. On a suite 1. We demonstrate the usability of attention not only for aligning of 16 standard sentence evaluation tasks, our embeddings show words to acoustic frames without any forced alignment but also for competitive performance against a word2vec model trained on the constructing Contextual Acoustic Word Embeddings (CAWE).
    [Show full text]
  • Lecture 26 Word Embeddings and Recurrent Nets
    CS447: Natural Language Processing http://courses.engr.illinois.edu/cs447 Where we’re at Lecture 25: Word Embeddings and neural LMs Lecture 26: Recurrent networks Lecture 26 Lecture 27: Sequence labeling and Seq2Seq Lecture 28: Review for the final exam Word Embeddings and Lecture 29: In-class final exam Recurrent Nets Julia Hockenmaier [email protected] 3324 Siebel Center CS447: Natural Language Processing (J. Hockenmaier) !2 What are neural nets? Simplest variant: single-layer feedforward net For binary Output unit: scalar y classification tasks: Single output unit Input layer: vector x Return 1 if y > 0.5 Recap Return 0 otherwise For multiclass Output layer: vector y classification tasks: K output units (a vector) Input layer: vector x Each output unit " yi = class i Return argmaxi(yi) CS447: Natural Language Processing (J. Hockenmaier) !3 CS447: Natural Language Processing (J. Hockenmaier) !4 Multi-layer feedforward networks Multiclass models: softmax(yi) We can generalize this to multi-layer feedforward nets Multiclass classification = predict one of K classes. Return the class i with the highest score: argmaxi(yi) Output layer: vector y In neural networks, this is typically done by using the softmax N Hidden layer: vector hn function, which maps real-valued vectors in R into a distribution … … … over the N outputs … … … … … …. For a vector z = (z0…zK): P(i) = softmax(zi) = exp(zi) ∕ ∑k=0..K exp(zk) Hidden layer: vector h1 (NB: This is just logistic regression) Input layer: vector x CS447: Natural Language Processing (J. Hockenmaier) !5 CS447: Natural Language Processing (J. Hockenmaier) !6 Neural Language Models LMs define a distribution over strings: P(w1….wk) LMs factor P(w1….wk) into the probability of each word: " P(w1….wk) = P(w1)·P(w2|w1)·P(w3|w1w2)·…· P(wk | w1….wk#1) A neural LM needs to define a distribution over the V words in Neural Language the vocabulary, conditioned on the preceding words.
    [Show full text]
  • Artificial Intelligence Applied to Electromechanical Monitoring, A
    ARTIFICIAL INTELLIGENCE APPLIED TO ELECTROMECHANICAL MONITORING, A PERFORMANCE ANALYSIS Erasmus project Authors: Staš Osterc Mentor: Dr. Miguel Delgado Prieto, Dr. Francisco Arellano Espitia 1/5/2020 P a g e II ANNEX VI – DECLARACIÓ D’HONOR P a g e II I declare that, the work in this Master Thesis / Degree Thesis (choose one) is completely my own work, no part of this Master Thesis / Degree Thesis (choose one) is taken from other people’s work without giving them credit, all references have been clearly cited, I’m authorised to make use of the company’s / research group (choose one) related information I’m providing in this document (select when it applies). I understand that an infringement of this declaration leaves me subject to the foreseen disciplinary actions by The Universitat Politècnica de Catalunya - BarcelonaTECH. ___________________ __________________ ___________ Student Name Signature Date Title of the Thesis : _________________________________________________ ________________________________________________________________ ________________________________________________________________ ________________________________________________________________ P a g e II Contents Introduction........................................................................................................................................5 Abstract ..........................................................................................................................................5 Aim .................................................................................................................................................6
    [Show full text]
  • Unsupervised Speech Representation Learning Using Wavenet Autoencoders Jan Chorowski, Ron J
    1 Unsupervised speech representation learning using WaveNet autoencoders Jan Chorowski, Ron J. Weiss, Samy Bengio, Aaron¨ van den Oord Abstract—We consider the task of unsupervised extraction speaker gender and identity, from phonetic content, properties of meaningful latent representations of speech by applying which are consistent with internal representations learned autoencoding neural networks to speech waveforms. The goal by speech recognizers [13], [14]. Such representations are is to learn a representation able to capture high level semantic content from the signal, e.g. phoneme identities, while being desired in several tasks, such as low resource automatic speech invariant to confounding low level details in the signal such as recognition (ASR), where only a small amount of labeled the underlying pitch contour or background noise. Since the training data is available. In such scenario, limited amounts learned representation is tuned to contain only phonetic content, of data may be sufficient to learn an acoustic model on the we resort to using a high capacity WaveNet decoder to infer representation discovered without supervision, but insufficient information discarded by the encoder from previous samples. Moreover, the behavior of autoencoder models depends on the to learn the acoustic model and a data representation in a fully kind of constraint that is applied to the latent representation. supervised manner [15], [16]. We compare three variants: a simple dimensionality reduction We focus on representations learned with autoencoders bottleneck, a Gaussian Variational Autoencoder (VAE), and a applied to raw waveforms and spectrogram features and discrete Vector Quantized VAE (VQ-VAE). We analyze the quality investigate the quality of learned representations on LibriSpeech of learned representations in terms of speaker independence, the ability to predict phonetic content, and the ability to accurately re- [17].
    [Show full text]
  • Arxiv:2007.00183V2 [Eess.AS] 24 Nov 2020
    WHOLE-WORD SEGMENTAL SPEECH RECOGNITION WITH ACOUSTIC WORD EMBEDDINGS Bowen Shi, Shane Settle, Karen Livescu TTI-Chicago, USA fbshi,settle.shane,[email protected] ABSTRACT Segmental models are sequence prediction models in which scores of hypotheses are based on entire variable-length seg- ments of frames. We consider segmental models for whole- word (“acoustic-to-word”) speech recognition, with the feature vectors defined using vector embeddings of segments. Such models are computationally challenging as the number of paths is proportional to the vocabulary size, which can be orders of magnitude larger than when using subword units like phones. We describe an efficient approach for end-to-end whole-word segmental models, with forward-backward and Viterbi de- coding performed on a GPU and a simple segment scoring function that reduces space complexity. In addition, we inves- tigate the use of pre-training via jointly trained acoustic word embeddings (AWEs) and acoustically grounded word embed- dings (AGWEs) of written word labels. We find that word error rate can be reduced by a large margin by pre-training the acoustic segment representation with AWEs, and additional Fig. 1. Whole-word segmental model for speech recognition. (smaller) gains can be obtained by pre-training the word pre- Note: boundary frames are not shared. diction layer with AGWEs. Our final models improve over segmental models, where the sequence probability is com- prior A2W models. puted based on segment scores instead of frame probabilities. Index Terms— speech recognition, segmental model, Segmental models have a long history in speech recognition acoustic-to-word, acoustic word embeddings, pre-training research, but they have been used primarily for phonetic recog- nition or as phone-level acoustic models [11–18].
    [Show full text]
  • Overview of Lstms and Word2vec and a Bit About Compositional Distributional Semantics If There’S Time
    Overview of LSTMs and word2vec and a bit about compositional distributional semantics if there’s time Ann Copestake Computer Laboratory University of Cambridge November 2016 Outline RNNs and LSTMs Word2vec Compositional distributional semantics Some slides adapted from Aurelie Herbelot. Outline. RNNs and LSTMs Word2vec Compositional distributional semantics Motivation I Standard NNs cannot handle sequence information well. I Can pass them sequences encoded as vectors, but input vectors are fixed length. I Models are needed which are sensitive to sequence input and can output sequences. I RNN: Recurrent neural network. I Long short term memory (LSTM): development of RNN, more effective for most language applications. I More info: http://neuralnetworksanddeeplearning.com/ (mostly about simpler models and CNNs) https://karpathy.github.io/2015/05/21/rnn-effectiveness/ http://colah.github.io/posts/2015-08-Understanding-LSTMs/ Sequences I Video frame categorization: strict time sequence, one output per input. I Real-time speech recognition: strict time sequence. I Neural MT: target not one-to-one with source, order differences. I Many language tasks: best to operate left-to-right and right-to-left (e.g., bi-LSTM). I attention: model ‘concentrates’ on part of input relevant at a particular point. Caption generation: treat image data as ordered, align parts of image with parts of caption. Recurrent Neural Networks http://colah.github.io/posts/ 2015-08-Understanding-LSTMs/ RNN language model: Mikolov et al, 2010 RNN as a language model I Input vector: vector for word at t concatenated to vector which is output from context layer at t − 1. I Performance better than n-grams but won’t capture ‘long-term’ dependencies: She shook her head.
    [Show full text]
  • Feedforward Neural Networks and Word Embeddings
    Feedforward Neural Networks and Word Embeddings Fabienne Braune1 1LMU Munich May 14th, 2017 Fabienne Braune (CIS) Feedforward Neural Networks and Word Embeddings May 14th, 2017 · 1 Outline 1 Linear models 2 Limitations of linear models 3 Neural networks 4 A neural language model 5 Word embeddings Fabienne Braune (CIS) Feedforward Neural Networks and Word Embeddings May 14th, 2017 · 2 Linear Models Fabienne Braune (CIS) Feedforward Neural Networks and Word Embeddings May 14th, 2017 · 3 Binary Classification with Linear Models Example: the seminar at < time > 4 pm will Classification task: Do we have an < time > tag in the current position? Word Lemma LexCat Case SemCat Tag the the Art low seminar seminar Noun low at at Prep low stime 4 4 Digit low pm pm Other low timeid will will Verb low Fabienne Braune (CIS) Feedforward Neural Networks and Word Embeddings May 14th, 2017 · 4 Feature Vector Encode context into feature vector: 1 bias term 1 2 -3 lemma the 1 3 -3 lemma giraffe 0 ... ... ... 102 -2 lemma seminar 1 103 -2 lemma giraffe 0 ... ... ... 202 -1 lemma at 1 203 -1 lemma giraffe 0 ... ... ... 302 +1 lemma 4 1 303 +1 lemma giraffe 0 ... ... ... Fabienne Braune (CIS) Feedforward Neural Networks and Word Embeddings May 14th, 2017 · 5 Dot product with (initial) weight vector 2 3 2 3 x0 = 1 w0 = 1:00 6 x = 1 7 6 w = 0:01 7 6 1 7 6 1 7 6 x = 0 7 6 w = 0:01 7 6 2 7 6 2 7 6 ··· 7 6 ··· 7 6 7 6 7 6x = 17 6x = 0:017 6 101 7 6 101 7 6 7 6 7 6x102 = 07 6x102 = 0:017 T 6 7 6 7 h(X )= X · Θ X = 6 ··· 7 Θ = 6 ··· 7 6 7 6 7 6x201 = 17 6x201 = 0:017 6 7
    [Show full text]
  • An Unsupervised Method for OCR Post-Correction and Spelling
    An Unsupervised method for OCR Post-Correction and Spelling Normalisation for Finnish Quan Duong,♣ Mika Hämäläinen,♣,♦ Simon Hengchen,♠ firstname.lastname@{helsinki.fi;gu.se} ♣University of Helsinki, ♦Rootroo Ltd, ♠Språkbanken Text – University of Gothenburg Abstract Historical corpora are known to contain errors introduced by OCR (optical character recognition) methods used in the digitization process, often said to be degrading the performance of NLP systems. Correcting these errors manually is a time-consuming process and a great part of the automatic approaches have been relying on rules or supervised machine learning. We build on previous work on fully automatic unsupervised extraction of parallel data to train a character- based sequence-to-sequence NMT (neural machine translation) model to conduct OCR error correction designed for English, and adapt it to Finnish by proposing solutions that take the rich morphology of the language into account. Our new method shows increased performance while remaining fully unsupervised, with the added benefit of spelling normalisation. The source code and models are available on GitHub1 and Zenodo2. 1 Introduction As many people dealing with digital humanities study historical data, the problem that researchers continue to face is the quality of their digitized corpora. There have been large scale projects in the past focusing on digitizing parts of cultural history using OCR models available in those times (see Benner (2003), Bremer-Laamanen (2006)). While OCR methods have substantially developed and im- proved in recent times, it is still not always possible to re-OCR text. Re-OCRing documents is also not a priority when not all collections are digitised, as is the case in many national libraries.
    [Show full text]