Predicting Sequences: Structured Perceptron

Total Page:16

File Type:pdf, Size:1020Kb

Predicting Sequences: Structured Perceptron Predicting Sequences: Structured Perceptron CS 6355: Structured Prediction 1 Conditional Random Fields summary • An undirected graphical model – Decompose the score over the structure into a collection of factors – Each factor assigns a score to assignment of the random variables it is connected to • Training and prediction – Final prediction via argmax wTÁ(x, y) – Train By maximum (regularized) likelihood • Connections to other models – Effectively a linear classifier – A generalization of logistic regression to structures – An conditional variant of a Markov Random Field • We will see this soon 2 Global features The feature function decomposes over the sequence y0 y1 y2 y3 x � � � � �(�, �0, �1) � �(�, �+, �2) � �(�, �2, �3) 3 Outline • Sequence models • Hidden Markov models – Inference with HMM – Learning • Conditional Models and Local Classifiers • Global models – Conditional Random Fields – Structured Perceptron for sequences 4 HMM is also a linear classifier Consider the HMM: 6 � �, � = 2 � �3 �34+ � �3 �3 3 5 HMM is also a linear classifier Consider the HMM: 6 � �, � = 2 � �3 �34+ � �3 �3 3 Transitions Emissions 6 HMM is also a linear classifier Consider the HMM: 6 � �, � = 2 � �3 �34+ � �3 �3 3 Or equivalently 6 log � �, � = : log � �3 ∣ �34+ + log � �3 ∣ �3 3 Log joint proBaBility = transition scores + emission scores 7 HMM is also a linear classifier 6 log � �, � = : log � �3 ∣ �34+ + log � �3 ∣ �3 3 Log joint proBaBility = transition scores + emission scores Let us examine this expression using a carefully defined set of indicator functions 8 HMM is also a linear classifier 6 log � �, � = : log � �3 ∣ �34+ + log � �3 ∣ �3 3 Log joint proBaBility = transition scores + emission scores Let us examine this expression using a carefully defined set of indicator functions 1, � is true, � = @ ? 0, � is false. Indicators are functions that map Booleans to 0 or 1 9 HMM is also a linear classifier 6 log � �, � = : log � �3 ∣ �34+ + log � �3 ∣ �3 3 Log joint proBaBility = transition scores + emission scores Let us examine this expression using a carefully defined set of indicator functions Equivalent to 6 6 : : log � � �L ⋅ � ⋅ � NOPQ [NOTUVWR] Q QR The indicators ensure that only one of the elements of the double summation is non-zero 10 HMM is also a linear classifier 6 log � �, � = : log � �3 ∣ �34+ + log � �3 ∣ �3 3 Log joint proBaBility = transition scores + emission scores Let us examine this expression using a carefully defined set of indicator functions Equivalent to 6 : log � �3 � ⋅ � NOPQ Q The indicators ensure that only one of the elements of the summation is non-zero 11 HMM is also a linear classifier 6 log � �, � = : log � �3 ∣ �34+ + log � �3 ∣ �3 3 Let us examine this expression using a carefully defined set of indicator functions 6 6 6 log � �, � = : : : log � � �L ⋅ � ⋅ � NOPQ [NOTUVWR] 3 Q6 QR6 + : : log � �3 � ⋅ � NOPQ 3 Q 12 HMM is also a linear classifier 6 log � �, � = : log � �3 ∣ �34+ + log � �3 ∣ �3 3 Let us examine this expression using a carefully defined set of indicator functions 6 6 6 log � �, � = : : : log � � �L ⋅ � ⋅ � NOPQ [NOTUVWR] 3 Q6 QR6 + : : log � �3 � ⋅ � NOPQ 6 6 3 Q 6 log � �, � = : : log �(� ∣ �L) : � � NOPQ [NOTUVWR] Q Q6R 3 6 + : log � �3 � : � NOPQ Q 3 13 HMM is also a linear classifier 6 log � �, � = : log � �3 ∣ �34+ + log � �3 ∣ �3 3 Let us examine this expression using a carefully defined set of indicator functions 6 6 6 log � �, � = : : log � � �L : � � NOPQ [NOTUVWR] Q QR6 36 + : log � �3 � : � NOPQ Q 3 Number of times there is a transition in the sequence from state �’ to state � Count(�L → �) 14 HMM is also a linear classifier 6 log � �, � = : log � �3 ∣ �34+ + log � �3 ∣ �3 3 Let us examine this expression using a carefully defined set of indicator functions 6 6 log � �, � = : : log � � �L ⋅ Count(�L → �) Q 6 QR 6 + : log � �3 � : � NOPQ Q 3 15 HMM is also a linear classifier 6 log � �, � = : log � �3 ∣ �34+ + log � �3 ∣ �3 3 Let us examine this expression using a carefully defined set of indicator functions 6 6 log � �, � = : : log � � �L ⋅ Count(�L → �) Q 6 QR 6 + : log � �3 � : � NOPQ Q 3 Number of times state � occurs in the sequence: Count(�) 16 HMM is also a linear classifier 6 log � �, � = : log � �3 ∣ �34+ + log � �3 ∣ �3 3 Let us examine this expression using a carefully defined set of indicator functions 6 6 log � �, � = : : log � � �L ⋅ Count(�L → �) Q 6 QR + : log � �3 � ⋅ Count � Q 17 HMM is also a linear classifier 6 log � �, � = : log � �3 ∣ �34+ + log � �3 ∣ �3 3 Let us examine this expression using a carefully defined set of indicator functions 6 6 log � �, � = : : log � � �L ⋅ Count(�L → �) Q 6 QR + : log � �3 � ⋅ Count � Q This is a linear function log P terms are the weights; counts via indicators are features Can be written as wTÁ(x, y) and add more features 18 HMM is a linear classifier: An example Det Noun VerB Det Noun The dog ate the homework 19 HMM is a linear classifier: An example Det Noun VerB Det Noun The dog ate the homework Consider 20 HMM is a linear classifier: An example Det Noun VerB Det Noun The dog ate the homework Consider Transition scores + Emission scores 21 HMM is a linear classifier: An example Det Noun VerB Det Noun The dog ate the homework Consider log P(Det ! Noun) £ 2 + log P(Noun ! VerB) £ 1 + Emission scores + log P(VerB ! Det) £ 1 22 HMM is a linear classifier: An example Det Noun VerB Det Noun The dog ate the homework Consider log P(The | Det) £ 1 log P(Det ! Noun) £ 2 + log P(dog| Noun) £ 1 + log P(Noun ! VerB) £ 1 + + log P(ate| Verb) £ 1 + log P(VerB ! Det) £ 1 + log P(the| Det) £ 1 + log P(homework| Noun) £ 1 23 HMM is a linear classifier: An example Det Noun VerB Det Noun The dog ate the homework Consider log P(The | Det) £ 1 log P(Det ! Noun) £ 2 + log P(dog| Noun) £ 1 + log P(Noun ! VerB) £ 1 + + log P(ate| Verb) £ 1 + log P(VerB ! Det) £ 1 + log P(the| Det) £ 1 + log P(homework| Noun) £ 1 w: Parameters of the model 24 HMM is a linear classifier: An example Det Noun VerB Det Noun The dog ate the homework Consider log P(The | Det) £ 1 log P(Det ! Noun) £ 2 + log P(dog| Noun) £ 1 Á(x, y): Properties of this output and the + log P(Noun ! VerB) £ 1 + input + log P(ate| Verb) £ 1 + log P(VerB ! Det) £ 1 + log P(the| Det) £ 1 + log P(homework| Noun) £ 1 25 HMM is a linear classifier: An example Det Noun VerB Det Noun The dog ate the homework Consider log �(��� → ����) 2 log �(���� → ����) 1 log �(���� → ���) 1 log � �ℎ� ���) 1 ⋅ log � ��� ����) 1 log � ��� ����) 1 log � �ℎ� ��� 1 log � ℎ������� ����) 1 Á(x, y): Properties of this w: Parameters output and the input of the model 26 HMM is a linear classifier: An example Det Noun VerB Det Noun The dog ate the homework Consider log �(��� → ����) 2 log �(���� → ����) 1 log P(x, y) = A linear scoring log �(���� → ���) 1 function = wTÁ(x,y) log � �ℎ� ���) 1 ⋅ log � ��� ����) 1 log � ��� ����) 1 log � �ℎ� ��� 1 log � ℎ������� ����) 1 Á(x, y): Properties of this w: Parameters output and the input of the model 27 Towards structured Perceptron 1. HMM is a linear classifier – Can we treat it as any linear classifier for training? – If so, we could add additional features that are gloBal properties • As long as the output can Be decomposed for easy inference 2. The ViterBi algorithm calculates max wTÁ(x, y) ViterBi only cares aBout scores to structures (not necessarily normalized) 3. We could push the learning algorithm to train for un-normalized scores – If we need normalization, we could always normalize By computing exponentiating and dividing By Z – That is, the learning algorithm can effectively just focus on the score of y for a particular x – Train a discriminative model! 28 Towards structured Perceptron 1. HMM is a linear classifier – Can we treat it as any linear classifier for training? – If so, we could add additional features that are gloBal properties • As long as the output can Be decomposed for easy inference 2. The ViterBi algorithm calculates max wTÁ(x, y) ViterBi only cares aBout scores to structures (not necessarily normalized) 3. We could push the learning algorithm to train for un-normalized scores – If we need normalization, we could always normalize By computing exponentiating and dividing By Z – That is, the learning algorithm can effectively just focus on the score of y for a particular x – Train a discriminative model! 29 Towards structured Perceptron 1. HMM is a linear classifier – Can we treat it as any linear classifier for training? – If so, we could add additional features that are gloBal properties • As long as the output can Be decomposed for easy inference 2. The ViterBi algorithm calculates max wTÁ(x, y) ViterBi only cares aBout scores to structures (not necessarily normalized) 3. We could push the learning algorithm to train for un-normalized scores – If we need normalization, we could always normalize By exponentiating and dividing By Z (the partition function) – That is, the learning algorithm can effectively just focus on the score of y for a particular x – Train a discriminative model instead of the generative one! 30 Structured Perceptron algorithm Given a training set D = {(x,y)} 1. Initialize w = 0 2 <n 2. For epoch = 1 … T: 1. For each training example (x, y) 2 D: T 1. PredictStructured Perceptron updatey’ = argmaxy’ w Á(x, y’) 2. If y ≠ y’, update w à w + learningRate (Á(x, y) - Á(x, y’)) 3. Return w T Prediction: argmaxy w Á(x, y) 31 Structured Perceptron algorithm Given a training set D = {(x,y)} 1.
Recommended publications
  • Information Extraction from Electronic Medical Records Using Multitask Recurrent Neural Network with Contextual Word Embedding
    applied sciences Article Information Extraction from Electronic Medical Records Using Multitask Recurrent Neural Network with Contextual Word Embedding Jianliang Yang 1 , Yuenan Liu 1, Minghui Qian 1,*, Chenghua Guan 2 and Xiangfei Yuan 2 1 School of Information Resource Management, Renmin University of China, 59 Zhongguancun Avenue, Beijing 100872, China 2 School of Economics and Resource Management, Beijing Normal University, 19 Xinjiekou Outer Street, Beijing 100875, China * Correspondence: [email protected]; Tel.: +86-139-1031-3638 Received: 13 August 2019; Accepted: 26 August 2019; Published: 4 September 2019 Abstract: Clinical named entity recognition is an essential task for humans to analyze large-scale electronic medical records efficiently. Traditional rule-based solutions need considerable human effort to build rules and dictionaries; machine learning-based solutions need laborious feature engineering. For the moment, deep learning solutions like Long Short-term Memory with Conditional Random Field (LSTM–CRF) achieved considerable performance in many datasets. In this paper, we developed a multitask attention-based bidirectional LSTM–CRF (Att-biLSTM–CRF) model with pretrained Embeddings from Language Models (ELMo) in order to achieve better performance. In the multitask system, an additional task named entity discovery was designed to enhance the model’s perception of unknown entities. Experiments were conducted on the 2010 Informatics for Integrating Biology & the Bedside/Veterans Affairs (I2B2/VA) dataset. Experimental results show that our model outperforms the state-of-the-art solution both on the single model and ensemble model. Our work proposes an approach to improve the recall in the clinical named entity recognition task based on the multitask mechanism.
    [Show full text]
  • Benchmarking Approximate Inference Methods for Neural Structured Prediction
    Benchmarking Approximate Inference Methods for Neural Structured Prediction Lifu Tu Kevin Gimpel Toyota Technological Institute at Chicago, Chicago, IL, 60637, USA {lifu,kgimpel}@ttic.edu Abstract The second approach is to retain computationally-intractable scoring functions Exact structured inference with neural net- but then use approximate methods for inference. work scoring functions is computationally challenging but several methods have been For example, some researchers relax the struc- proposed for approximating inference. One tured output space from a discrete space to a approach is to perform gradient descent continuous one and then use gradient descent to with respect to the output structure di- maximize the score function with respect to the rectly (Belanger and McCallum, 2016). An- output (Belanger and McCallum, 2016). Another other approach, proposed recently, is to train approach is to train a neural network (an “infer- a neural network (an “inference network”) to ence network”) to output a structure in the relaxed perform inference (Tu and Gimpel, 2018). In this paper, we compare these two families of space that has high score under the structured inference methods on three sequence label- scoring function (Tu and Gimpel, 2018). This ing datasets. We choose sequence labeling idea was proposed as an alternative to gradient because it permits us to use exact inference descent in the context of structured prediction as a benchmark in terms of speed, accuracy, energy networks (Belanger and McCallum, 2016). and search error. Across datasets, we demon- In this paper, we empirically compare exact in- strate that inference networks achieve a better ference, gradient descent, and inference networks speed/accuracy/search error trade-off than gra- dient descent, while also being faster than ex- for three sequence labeling tasks.
    [Show full text]
  • Learning Distributed Representations for Structured Output Prediction
    Learning Distributed Representations for Structured Output Prediction Vivek Srikumar∗ Christopher D. Manning University of Utah Stanford University [email protected] [email protected] Abstract In recent years, distributed representations of inputs have led to performance gains in many applications by allowing statistical information to be shared across in- puts. However, the predicted outputs (labels, and more generally structures) are still treated as discrete objects even though outputs are often not discrete units of meaning. In this paper, we present a new formulation for structured predic- tion where we represent individual labels in a structure as dense vectors and allow semantically similar labels to share parameters. We extend this representation to larger structures by defining compositionality using tensor products to give a natural generalization of standard structured prediction approaches. We define a learning objective for jointly learning the model parameters and the label vectors and propose an alternating minimization algorithm for learning. We show that our formulation outperforms structural SVM baselines in two tasks: multiclass document classification and part-of-speech tagging. 1 Introduction In recent years, many computer vision and natural language processing (NLP) tasks have benefited from the use of dense representations of inputs by allowing superficially different inputs to be related to one another [26, 9, 7, 4]. For example, even though words are not discrete units of meaning, tradi- tional NLP models use indicator features for words. This forces learning algorithms to learn separate parameters for orthographically distinct but conceptually similar words. In contrast, dense vector representations allow sharing of statistical signal across words, leading to better generalization.
    [Show full text]
  • A Multitask Bi-Directional RNN Model for Named Entity Recognition On
    Chowdhury et al. BMC Bioinformatics 2018, 19(Suppl 17):499 https://doi.org/10.1186/s12859-018-2467-9 RESEARCH Open Access A multitask bi-directional RNN model for named entity recognition on Chinese electronic medical records Shanta Chowdhury1,XishuangDong1,LijunQian1, Xiangfang Li1*, Yi Guan2, Jinfeng Yang3 and Qiubin Yu4 From The International Conference on Intelligent Biology and Medicine (ICIBM) 2018 Los Angeles, CA, USA. 10-12 June 2018 Abstract Background: Electronic Medical Record (EMR) comprises patients’ medical information gathered by medical stuff for providing better health care. Named Entity Recognition (NER) is a sub-field of information extraction aimed at identifying specific entity terms such as disease, test, symptom, genes etc. NER can be a relief for healthcare providers and medical specialists to extract useful information automatically and avoid unnecessary and unrelated information in EMR. However, limited resources of available EMR pose a great challenge for mining entity terms. Therefore, a multitask bi-directional RNN model is proposed here as a potential solution of data augmentation to enhance NER performance with limited data. Methods: A multitask bi-directional RNN model is proposed for extracting entity terms from Chinese EMR. The proposed model can be divided into a shared layer and a task specific layer. Firstly, vector representation of each word is obtained as a concatenation of word embedding and character embedding. Then Bi-directional RNN is used to extract context information from sentence. After that, all these layers are shared by two different task layers, namely the parts-of-speech tagging task layer and the named entity recognition task layer.
    [Show full text]
  • Lecture Seq2seq2 2019.Pdf
    10707 Deep Learning Russ Salakhutdinov Machine Learning Department [email protected] http://www.cs.cmu.edu/~rsalakhu/10707/ Sequence to Sequence II Slides borrowed from ICML Tutorial Seq2Seq ICML Tutorial Oriol Vinyals and Navdeep Jaitly @OriolVinyalsML | @NavdeepLearning Site: https://sites.google.com/view/seq2seq-icml17 Sydney, Australia, 2017 Applications Sentence to Constituency Parse Tree 1. Read a sentence 2. Flatten the tree into a sequence (adding (,) ) 3. “Translate” from sentence to parse tree Vinyals, O., et al. “Grammar as a foreign language.” NIPS (2015). Speech Recognition p(yi+1|y1..i, x) y1..i Decoder / transducer yi+1 Transcript f(x) Cancel cancel cancel Chan, W., Jaitly, N., Le, Q., Vinyals, O. “Listen Attend and Spell.” ICASSP (2015). Attention Example prediction derived from Attention vector - where the “attending” to segment model thinks the relevant of input information is to be found time Chan, W., Jaitly, N., Le, Q., Vinyals, O. “Listen Attend and Spell.” ICASSP (2015). Attention Example time Chan, W., Jaitly, N., Le, Q., Vinyals, O. “Listen Attend and Spell.” ICASSP (2015). Attention Example time Chan, W., Jaitly, N., Le, Q., Vinyals, O. “Listen Attend and Spell.” ICASSP (2015). Attention Example time Chan, W., Jaitly, N., Le, Q., Vinyals, O. “Listen Attend and Spell.” ICASSP (2015). Attention Example time Chan, W., Jaitly, N., Le, Q., Vinyals, O. “Listen Attend and Spell.” ICASSP (2015). Attention Example time Chan, W., Jaitly, N., Le, Q., Vinyals, O. “Listen Attend and Spell.” ICASSP (2015). Attention Example Chan, W., Jaitly, N., Le, Q., Vinyals, O. “Listen Attend and Spell.” ICASSP (2015). Caption Generation with Visual Attention A man riding a horse in a field.
    [Show full text]
  • ML Cheatsheet Documentation
    ML Cheatsheet Documentation Team Sep 02, 2021 Basics 1 Linear Regression 3 2 Gradient Descent 21 3 Logistic Regression 25 4 Glossary 39 5 Calculus 45 6 Linear Algebra 57 7 Probability 67 8 Statistics 69 9 Notation 71 10 Concepts 75 11 Forwardpropagation 81 12 Backpropagation 91 13 Activation Functions 97 14 Layers 105 15 Loss Functions 117 16 Optimizers 121 17 Regularization 127 18 Architectures 137 19 Classification Algorithms 151 20 Clustering Algorithms 157 i 21 Regression Algorithms 159 22 Reinforcement Learning 161 23 Datasets 165 24 Libraries 181 25 Papers 211 26 Other Content 217 27 Contribute 223 ii ML Cheatsheet Documentation Brief visual explanations of machine learning concepts with diagrams, code examples and links to resources for learning more. Warning: This document is under early stage development. If you find errors, please raise an issue or contribute a better definition! Basics 1 ML Cheatsheet Documentation 2 Basics CHAPTER 1 Linear Regression • Introduction • Simple regression – Making predictions – Cost function – Gradient descent – Training – Model evaluation – Summary • Multivariable regression – Growing complexity – Normalization – Making predictions – Initialize weights – Cost function – Gradient descent – Simplifying with matrices – Bias term – Model evaluation 3 ML Cheatsheet Documentation 1.1 Introduction Linear Regression is a supervised machine learning algorithm where the predicted output is continuous and has a constant slope. It’s used to predict values within a continuous range, (e.g. sales, price) rather than trying to classify them into categories (e.g. cat, dog). There are two main types: Simple regression Simple linear regression uses traditional slope-intercept form, where m and b are the variables our algorithm will try to “learn” to produce the most accurate predictions.
    [Show full text]
  • CRF, CTC, Word2vec
    NPFL114, Lecture 9 CRF, CTC, Word2Vec Milan Straka April 26, 2021 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated Structured Prediction Structured Prediction NPFL114, Lecture 9 CRF CTC CTCDecoding Word2Vec CLEs Subword Embeddings 2/30 Structured Prediction Consider generating a sequence of N given input . y1, … , yN ∈ Y x1, … , xN Predicting each sequence element independently models the distribution . P (yi∣ X) However, there may be dependencies among the themselves, which is difficult to capture by yi independent element classification. NPFL114, Lecture 9 CRF CTC CTCDecoding Word2Vec CLEs Subword Embeddings 3/30 Maximum Entropy Markov Models We might model the dependencies by assuming that the output sequence is a Markov chain, and model it as P (yi∣ X, yi−1) . Each label would be predicted by a softmax from the hidden state and the previous label. The decoding can be then performed by a dynamic programming algorithm. NPFL114, Lecture 9 CRF CTC CTCDecoding Word2Vec CLEs Subword Embeddings 4/30 Maximum Entropy Markov Models However, MEMMs suffer from a so-called label bias problem. Because the probability is factorized, each is a distribution and must sum to one. P (yi∣ X, yi−1) Imagine there was a label error during prediction. In the next step, the model might “realize” that the previous label has very low probability of being followed by any label – however, it cannot express this by setting the probability of all following labels low, it has to “conserve the mass”. NPFL114, Lecture 9 CRF CTC CTCDecoding Word2Vec CLEs Subword Embeddings 5/30 Conditional Random Fields Let G = (V , E) be a graph such that y is indexed by vertices of G.
    [Show full text]
  • Neural Networks for Linguistic Structured Prediction and Their Interpretability
    Neural Networks for Linguistic Structured Prediction and Their Interpretability Xuezhe Ma CMU-LTI-20-001 Language Technologies Institute School of Computer Science Carnegie Mellon University 5000 Forbes Ave., Pittsburgh, PA 15213 www.lti.cs.cmu.edu Thesis Committee: Eduard Hovy (Chair), Carnegie Mellon University Jaime Carbonell, Carnegie Mellon University Yulia Tsvetkov, Carnegie Mellon University Graham Neubig, Carnegie Mellon University Joakim Nivre, Uppsala University Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy n Language and Information Technologies Copyright c 2020 Xuezhe Ma Abstract Linguistic structured prediction, such as sequence labeling, syntactic and seman- tic parsing, and coreference resolution, is one of the first stages in deep language understanding and its importance has been well recognized in the natural language processing community, and has been applied to a wide range of down-stream tasks. Most traditional high performance linguistic structured prediction models are linear statistical models, including Hidden Markov Models (HMM) and Conditional Random Fields (CRF), which rely heavily on hand-crafted features and task-specific resources. However, such task-specific knowledge is costly to develop, making struc- tured prediction models difficult to adapt to new tasks or new domains. In the past few years, non-linear neural networks with as input distributed word representations have been broadly applied to NLP problems with great success. By utilizing distributed representations as inputs, these systems are capable of learning hidden representations directly from data instead of manually designing hand-crafted features. Despite the impressive empirical successes of applying neural networks to linguis- tic structured prediction tasks, there are at least two major problems: 1) there is no a consistent architecture for, at least of components of, different structured prediction tasks that is able to be trained in a truely end-to-end setting.
    [Show full text]
  • PDF Hosted at the Radboud Repository of the Radboud University Nijmegen
    PDF hosted at the Radboud Repository of the Radboud University Nijmegen The following full text is a publisher's version. For additional information about this publication click this link. http://hdl.handle.net/2066/179506 Please be advised that this information was generated on 2021-09-30 and may be subject to change. Algorithmic composition of polyphonic music with the WaveCRF Umut Güçlü*, Yagmur˘ Güçlütürk*, Luca Ambrogioni, Eric Maris, Rob van Lier, Marcel van Gerven, Radboud University, Donders Institute for Brain, Cognition and Behaviour Nijmegen, the Netherlands {u.guclu, y.gucluturk}@donders.ru.nl *Equal contribution Abstract Here, we propose a new approach for modeling conditional probability distributions of polyphonic music by combining WaveNET and CRF-RNN variants, and show that this approach beats LSTM and WaveNET baselines that do not take into account the statistical dependencies between simultaneous notes. 1 Introduction The history of algorithmically generated music dates back to the musikalisches würfelspiel (musical dice game) of the eighteenth century. These early methods share with modern machine learning techniques the emphasis on stochasticity as a crucial ingredient for “machine creativity”. In recent years, artificial neural networks dominated the creative music generation literature [3]. In several of the most successful architectures, such as WaveNet, the output of the network is an array that specifies the probabilities of each note given the already produced composition [3]. This approach tends to be infeasible in the case of polyphonic music because the network would need to output a probability value for every possible simultaneous combination of notes. The naive strategy of producing each note independently produces less realistic compositions as it ignores the statistical dependencies between simultaneous notes.
    [Show full text]
  • Pystruct - Learning Structured Prediction in Python Andreas C
    Journal of Machine Learning Research 15 (2014) 2055-2060 Submitted 8/13; Revised 2/14; Published 6/14 PyStruct - Learning Structured Prediction in Python Andreas C. M¨uller [email protected] Sven Behnke [email protected] Institute of Computer Science, Department VI University of Bonn Bonn, Germany Editor: Mark Reid Abstract Structured prediction methods have become a central tool for many machine learning ap- plications. While more and more algorithms are developed, only very few implementations are available. PyStruct aims at providing a general purpose implementation of standard structured prediction methods, both for practitioners and as a baseline for researchers. It is written in Python and adapts paradigms and types from the scientific Python community for seamless integration with other projects. Keywords: structured prediction, structural support vector machines, conditional ran- dom fields, Python 1. Introduction In recent years there has been a wealth of research in methods for learning structured prediction, as well as in their application in areas such as natural language processing and computer vision. Unfortunately only few implementations are publicly available|many applications are based on the non-free implementation of Joachims et al. (2009). PyStruct aims at providing a high-quality implementation with an easy-to-use inter- face, in the high-level Python language. This allows practitioners to efficiently test a range of models, as well as allowing researchers to compare to baseline methods much more easily than this is possible with current implementations. PyStruct is BSD-licensed, allowing modification and redistribution of the code, as well as use in commercial applications.
    [Show full text]
  • Conditional Random Fields Meet Deep Neural Networks for Semantic
    IEEE SIGNAL PROCESSING MAGAZINE, VOL. XX, NO. XX, JANUARY 2018 1 Conditional Random Fields Meet Deep Neural Networks for Semantic Segmentation Anurag Arnab∗, Shuai Zheng∗, Sadeep Jayasumana, Bernardino Romera-Paredes, Mans˚ Larsson, Alexander Kirillov, Bogdan Savchynskyy, Carsten Rother, Fredrik Kahl and Philip Torr Abstract—Semantic Segmentation is the task of labelling every object category. Semantic Segmentation, the main focus of this pixel in an image with a pre-defined object category. It has numer- article, aims for a more precise understanding of the scene by ous applications in scenarios where the detailed understanding assigning an object category label to each pixel within the of an image is required, such as in autonomous vehicles and image. Recently, researchers have also begun tackling new medical diagnosis. This problem has traditionally been solved scene understanding problems such as instance segmentation, with probabilistic models known as Conditional Random Fields which aims to assign a unique identifier to each segmented (CRFs) due to their ability to model the relationships between the pixels being predicted. However, Deep Neural Networks object in the image, as well as bridging the gap between natural (DNNs) have recently been shown to excel at a wide range of language processing and computer vision with tasks such as computer vision problems due to their ability to learn rich feature image captioning and visual question answering, which aim at representations automatically from data, as opposed to traditional describing an image in words, and answering textual questions hand-crafted features. The idea of combining CRFs and DNNs from images respectively. have achieved state-of-the-art results in a number of domains.
    [Show full text]
  • Evaluating the Utility of Hand-Crafted Features in Sequence Labelling∗
    Evaluating the Utility of Hand-crafted Features in Sequence Labelling∗ Minghao Wu♠♥y Fei Liu♠ Trevor Cohn♠ ♠The University of Melbourne, Victoria, Australia ~JD AI Research, Beijing, China [email protected] [email protected] [email protected] Abstract Orthogonal to the advances in deep learning is the effort spent on feature engineering. A rep- Conventional wisdom is that hand-crafted fea- resentative example is the task of named entity tures are redundant for deep learning mod- recognition (NER), one that requires both lexi- els, as they already learn adequate represen- cal and syntactic knowledge, where, until recently, tations of text automatically from corpora. In this work, we test this claim by proposing a most models heavily rely on statistical sequential new method for exploiting handcrafted fea- labelling models taking in manually engineered tures as part of a novel hybrid learning ap- features (Florian et al., 2003; Chieu and Ng, 2002; proach, incorporating a feature auto-encoder Ando and Zhang, 2005). Typical features include loss component. We evaluate on the task of POS and chunk tags, prefixes and suffixes, and ex- named entity recognition (NER), where we ternal gazetteers, all of which represent years of show that including manual features for part- accumulated knowledge in the field of computa- of-speech, word shapes and gazetteers can im- tional linguistics. prove the performance of a neural CRF model. The work of Collobert et al.(2011) started We obtain a F1 of 91.89 for the CoNLL-2003 English shared task, which significantly out- the trend of feature engineering-free modelling performs a collection of highly competitive by learning internal representations of compo- baseline models.
    [Show full text]