Conditional Random Fields

Total Page:16

File Type:pdf, Size:1020Kb

Conditional Random Fields Machine Learning Srihari Conditional Random Fields Sargur N. Srihari [email protected] Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/CSE574/index.html 1 Machine Learning Srihari Outline 1. Generative and Discriminative Models 2. Classifiers Naïve Bayes and Logistic Regression 3. Sequential Models HMM and CRF Markov Random Field 4. CRF vs HMM performance comparison NLP: Table extraction, POS tagging, Shallow parsing, Document analysis 5. CRFs in Computer Vision 6. Summary 7. References 2 Machine Learning Srihari Methods for Classification • Generative Models (Two-step) • Infer class-conditional densities p(x|Ck) and priors p(Ck) • then use Bayes theorem to determine posterior probabilities. p(x | C ) p(C ) p(C | x) = k k k p(x) • Discriminative Models (One-step) • Directly infer posterior probabilities p(Ck|x) • Decision Theory – In both cases use decision theory to assign each new x to a class 3 Machine Learning Srihari Generative Classifier • Given M variables, x =(x1,..,xM), class variable y and joint distribution p(x,y) we can – Marginalize py()= ∑ p(x,y) x p(x, y) py(|x)= – Condition p(x) • By conditioning the joint pdf we form a classifier • Huge need for samples M –If xi are binary, need 2 values to specify p(x,y) •IfM=10, there are two classes and 100 samples are needed to estimate a given probability, then we need 2048 samples 4 Machine Learning Srihari Classification of ML Methods • Generative Methods – “Generative” since sampling can generate synthetic data points – Popular models • Naïve Bayes, Mixtures of multinomials • Mixtures of Gaussians, Hidden Markov Models • Bayesian networks, Markov random fields • Discriminative Methods – Focus on given task– better performance – Popular models • Logistic regression, SVMs • Traditional neural networks, Nearest neighbor • Conditional Random Fields (CRF) 5 Machine Learning Srihari Generative-Discriminative Pairs • Naïve Bayes and Logistic Regression form a generative-discriminative pair • Their relationship mirrors that between HMMs and linear-chain CRFs 6 Machine Learning Srihari Graphical Model Relationship Hidden Markov Model E Naïve Bayes Classifier V I T y Y A y1 yN R SEQUENCE E N E X p(Y,X) G x p(y,x) x1 xM x1 xN CONDITION CONDITION p(y/x) p(Y/X) E V I T A N SEQUENCE I M I R C S I Logistic Regression Conditional Random Field D 7 Machine Learning Srihari Naïve Bayes Classifier • Goal is to predict single class variable y given a vector of features x=(x1,..,xM) • Assume that once class labels are known the features are independent • Joint probability model has the form M py(,x)= py()∏ p(xm |y) m=1 – Need to estimate only M probabilities • Factor graph obtained by defining factors ψ(y)=p(y), ψm(y,xm)=p(xm,y) 8 Machine Learning Srihari Logistic Regression Classifier Logistic Sigmoid • Feature vector x • Two-class classification: class σ(a) variable y has values C1 and C2 a • A posteriori probability p(C1|x) Properties: written as A. Symmetry T p(C1|x) =f(x) = σ (w x) where σ(-a)=1-σ(a) 1 B. Inverse σ ()a = 1e+ xp(−a) a=ln(σ /1-σ) known as logit. • Known as logistic regression in Also known as statistics log odds since – Although a model for classification it is the ratio ln[p(C |x)/p(C |x)] rather than for regression 1 2 C. Derivative 9 dσ/da=σ(1-σ) Machine Learning Srihari Relationship between Logistic Regression and Generative Classifier • Posterior probability of class variable y is pC(x | 11)p(C) pC(|1 x)= p(x |C11)pC( )+ p(x |C2)pC( 2) 1 pC(x | )p(C) = ==σ (aa) where ln 11 1+−exp( ap) (x | C22) p(C) • In a generative model we estimate the class- conditionals (which are used to determine a) • In the discriminative approach we directly estimate a as a linear function of x i.e., a = wTx 10 Machine Learning Srihari Logistic Regression Parameters •With M variables logistic regression has M parameters w=(w1,..,wM) • By contrast, generative approach – by fitting Gaussian class-conditional densities will result in 2M parameters for means, M(M+1)/2 parameters for shared covariance matrix, and one for class prior p(C1) – Which can be reduced to O(M) parameters by assuming independence via Naïve Bayes 11 Machine Learning Srihari Learning Logistic Parameters N py(t | w) =−ttnn{1 y}1− • For data set {xn,tn}, tn={0,1} likelihood is ∏ n n T n=1 where t=(t1,..tN) and yn=p(C1|xn) • Parameter w specifies the yn as follows T yn=σ(an) and an=w xn • Defining cross-entropy error function N Ep(w) =−ln (t | w) =−∑{}tnnln y+(1−tn)ln(1−yn) • Taking gradient wrt w n=1 N ∇=Ey(w) ∑( nn−t)xn n=1 • Same form as sum-of-squares error for linear regression – Can use sequential algorithm where samples are presented one at a time using (1tt+ ) () ww= −∇η En 12 Machine Learning Srihari Multi-class Logistic Regression px( | C)(pC) pC(|x)= kk • Case of K>2 classes k ∑ p(xC| jj)(pC) j exp(a ) = k ∑exp(a j ) j • Known as normalized exponential where ak=ln p(x|Ck)p(Ck) • Normalized exponential also known as softmax since if ak>>aj then p(Ck|x)=1 and p(Cj|x)=0 • In logistic regression we assume activations given T by ak=wk x 13 Machine Learning Srihari Learning Parameters for Multiclass • Determining parameters {wk} by m.l.e. requires derivatives of y wrt all activations a ∂y k j k =−yI y kk()j j where I are elements of the identity matrix ∂a j kj • Likelihood function written using 1-of-K coding – Target vector tn for xn belonging to Ck is a binary vector with all elements zero except for element k which is one NK NK p(Tp| w ,.., w ) ==(C| x)ttnk ynk 1 Kk∏∏ n∏∏ nk nk==11 n=1k=1 where T is an N x K matrix of target variables with elements tnk • Cross entropy error function NK Ep(w11,..,w)KK=−ln (T|w,..w)=−∑∑tnklnynk nk==11 N ∑()ytnj − nj xn • Gradient of error function is n=1 – which allows a sequential weight vector update algorithm14 Machine Learning Srihari Graphical Model for Logistic Regression • Multiclass logistic regression can be written as 1 ⎧⎫K py( | x) =+exp ⎨⎬λλyy∑ jxj where Z(x) ⎩⎭j=1 ⎧⎫K Zx(x) = exp λλ+ ∑∑y ⎨⎬yyjj ⎩⎭j=1 • Rather than using one weight per class we can define feature functions that are nonzero only for a single class 1 ⎧ K ⎫ py(|x)= exp⎨∑λkkf(y,x)⎬ Z(x) ⎩⎭k=1 • This notation mirrors the usual notation for CRFs 15 Machine Learning Srihari 3. Sequence Models • Classifiers predict only a single class variable • Graphical Models are best to model many variables that are interdependent N • Given sequence of observations X={xn}n=1 N • Underlying sequence of states Y={yn}n=1 16 Machine Learning Srihari Sequence Models • Hidden Markov Model (HMM) Y X • Independence assumptions: • each state depends only on its immediate predecessor • each observation variable depends only on the current state • Limitations: • Strong independence assumptions among the observations X. • Introduction of a large number of parameters by modeling the joint probability p(y,x), which requires modeling the distribution p(x) 17 Machine Learning Srihari Sequence Models Hidden Markov Model (HMM) Y X Conditional Random Fields (CRF) Y x A key advantage of CRF is their great flexibility to include a wide variety of arbitrary, non-independent features of the observations. 18 Machine Learning Srihari Generative Model: HMM • X is observed data sequence to be labeled, y1 y2 yn yN Y is the random variable over the label sequences • HMM is a distribution that models x1 x2 xn xN p(Y, X) N • Joint distribution is p()YX,(= ∏ pynn|y−1)p(xn|yn) n=1 • Highly structured network indicates conditional independences, – past states independent of future states – Conditional independence of observed given its state. 19 Machine Learning Srihari Discriminative Model for Sequential Data • CRF models the conditional distribution p(Y/X) • CRF is a random field y y y y globally conditioned on the 1 2 n N observation X • The conditional distribution X p(Y|X) that follows from the joint distribution p(Y,X) can be rewritten as a Markov Random Field 20 Machine Learning Srihari Markov Random Field (MRF) • Also called undirected graphical model • Joint distribution of set of variables x is defined by an undirected graph as 1 p(x) = ∏ψ CC(x ) Z C where C is a maximal clique (each node connected to every other node), xC is the set of variables in that clique, ψC is a potential function (or local or compatibility function) such that ψC(xC) > 0, typically ψC(xC) = exp{-E(xC)}, and is the partition function for normalization Z = ∑ ∏ψ CC(x ) x C • Model refers to a family of distributions and Field refers to a specific one 21 Machine Learning Srihari MRF with Input-Output Variables • X is a set of input variables that are observed – Element of X is denoted x • Y is a set of output variables that we predict – Element of Y is denoted y • A are subsets of X U Y – Elements of A that are in A ^ X are denoted xA – Element of A that are in A ^ Y are denoted yA • Then undirected graphical model has the form 1 p(x,y) =Ψ∏∏A (x AA, y ) where Z=∑ ΨA(x A, y A) Z AAx,y 22 Machine Learning Srihari MRF Local Function • Assume each local function has the form ⎧ ⎫ Ψ=AA(x ,yA) exp⎨∑θ Amf Am(xA,yA) ⎬ ⎩⎭m where θA is a parameter vector, fA are feature functions and m=1,..M are feature subscripts 23 Machine Learning Srihari From HMM to CRF • In an HMM Indicator function: N 1{x = x’} takes value 1when p()YX,(= ∏ pynn|y−1)p(xn|yn) n=1 x=x’ and 0 otherwise • Can be rewritten as Parameters of 1 ⎧⎫the distribution: p(,YX)=+exp λµ1 1 1 1 ⎨⎬∑∑ ij{}yinn=={y−1 j}∑∑∑ oi{yn=i}{xon=} θ ={λ ,µ } Z ⎩⎭ni, j∈∈S n iSo∈O ij oi • Further rewritten as Feature Functions have 1 ⎧⎫M the form fm(yn,yn-1,xn): pf()Y, X = exp⎨⎬∑λmm(yn,yn−1,xn) Z ⎩⎭m=1 Need one feature for
Recommended publications
  • Information Extraction from Electronic Medical Records Using Multitask Recurrent Neural Network with Contextual Word Embedding
    applied sciences Article Information Extraction from Electronic Medical Records Using Multitask Recurrent Neural Network with Contextual Word Embedding Jianliang Yang 1 , Yuenan Liu 1, Minghui Qian 1,*, Chenghua Guan 2 and Xiangfei Yuan 2 1 School of Information Resource Management, Renmin University of China, 59 Zhongguancun Avenue, Beijing 100872, China 2 School of Economics and Resource Management, Beijing Normal University, 19 Xinjiekou Outer Street, Beijing 100875, China * Correspondence: [email protected]; Tel.: +86-139-1031-3638 Received: 13 August 2019; Accepted: 26 August 2019; Published: 4 September 2019 Abstract: Clinical named entity recognition is an essential task for humans to analyze large-scale electronic medical records efficiently. Traditional rule-based solutions need considerable human effort to build rules and dictionaries; machine learning-based solutions need laborious feature engineering. For the moment, deep learning solutions like Long Short-term Memory with Conditional Random Field (LSTM–CRF) achieved considerable performance in many datasets. In this paper, we developed a multitask attention-based bidirectional LSTM–CRF (Att-biLSTM–CRF) model with pretrained Embeddings from Language Models (ELMo) in order to achieve better performance. In the multitask system, an additional task named entity discovery was designed to enhance the model’s perception of unknown entities. Experiments were conducted on the 2010 Informatics for Integrating Biology & the Bedside/Veterans Affairs (I2B2/VA) dataset. Experimental results show that our model outperforms the state-of-the-art solution both on the single model and ensemble model. Our work proposes an approach to improve the recall in the clinical named entity recognition task based on the multitask mechanism.
    [Show full text]
  • A Multitask Bi-Directional RNN Model for Named Entity Recognition On
    Chowdhury et al. BMC Bioinformatics 2018, 19(Suppl 17):499 https://doi.org/10.1186/s12859-018-2467-9 RESEARCH Open Access A multitask bi-directional RNN model for named entity recognition on Chinese electronic medical records Shanta Chowdhury1,XishuangDong1,LijunQian1, Xiangfang Li1*, Yi Guan2, Jinfeng Yang3 and Qiubin Yu4 From The International Conference on Intelligent Biology and Medicine (ICIBM) 2018 Los Angeles, CA, USA. 10-12 June 2018 Abstract Background: Electronic Medical Record (EMR) comprises patients’ medical information gathered by medical stuff for providing better health care. Named Entity Recognition (NER) is a sub-field of information extraction aimed at identifying specific entity terms such as disease, test, symptom, genes etc. NER can be a relief for healthcare providers and medical specialists to extract useful information automatically and avoid unnecessary and unrelated information in EMR. However, limited resources of available EMR pose a great challenge for mining entity terms. Therefore, a multitask bi-directional RNN model is proposed here as a potential solution of data augmentation to enhance NER performance with limited data. Methods: A multitask bi-directional RNN model is proposed for extracting entity terms from Chinese EMR. The proposed model can be divided into a shared layer and a task specific layer. Firstly, vector representation of each word is obtained as a concatenation of word embedding and character embedding. Then Bi-directional RNN is used to extract context information from sentence. After that, all these layers are shared by two different task layers, namely the parts-of-speech tagging task layer and the named entity recognition task layer.
    [Show full text]
  • ML Cheatsheet Documentation
    ML Cheatsheet Documentation Team Sep 02, 2021 Basics 1 Linear Regression 3 2 Gradient Descent 21 3 Logistic Regression 25 4 Glossary 39 5 Calculus 45 6 Linear Algebra 57 7 Probability 67 8 Statistics 69 9 Notation 71 10 Concepts 75 11 Forwardpropagation 81 12 Backpropagation 91 13 Activation Functions 97 14 Layers 105 15 Loss Functions 117 16 Optimizers 121 17 Regularization 127 18 Architectures 137 19 Classification Algorithms 151 20 Clustering Algorithms 157 i 21 Regression Algorithms 159 22 Reinforcement Learning 161 23 Datasets 165 24 Libraries 181 25 Papers 211 26 Other Content 217 27 Contribute 223 ii ML Cheatsheet Documentation Brief visual explanations of machine learning concepts with diagrams, code examples and links to resources for learning more. Warning: This document is under early stage development. If you find errors, please raise an issue or contribute a better definition! Basics 1 ML Cheatsheet Documentation 2 Basics CHAPTER 1 Linear Regression • Introduction • Simple regression – Making predictions – Cost function – Gradient descent – Training – Model evaluation – Summary • Multivariable regression – Growing complexity – Normalization – Making predictions – Initialize weights – Cost function – Gradient descent – Simplifying with matrices – Bias term – Model evaluation 3 ML Cheatsheet Documentation 1.1 Introduction Linear Regression is a supervised machine learning algorithm where the predicted output is continuous and has a constant slope. It’s used to predict values within a continuous range, (e.g. sales, price) rather than trying to classify them into categories (e.g. cat, dog). There are two main types: Simple regression Simple linear regression uses traditional slope-intercept form, where m and b are the variables our algorithm will try to “learn” to produce the most accurate predictions.
    [Show full text]
  • CRF, CTC, Word2vec
    NPFL114, Lecture 9 CRF, CTC, Word2Vec Milan Straka April 26, 2021 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated Structured Prediction Structured Prediction NPFL114, Lecture 9 CRF CTC CTCDecoding Word2Vec CLEs Subword Embeddings 2/30 Structured Prediction Consider generating a sequence of N given input . y1, … , yN ∈ Y x1, … , xN Predicting each sequence element independently models the distribution . P (yi∣ X) However, there may be dependencies among the themselves, which is difficult to capture by yi independent element classification. NPFL114, Lecture 9 CRF CTC CTCDecoding Word2Vec CLEs Subword Embeddings 3/30 Maximum Entropy Markov Models We might model the dependencies by assuming that the output sequence is a Markov chain, and model it as P (yi∣ X, yi−1) . Each label would be predicted by a softmax from the hidden state and the previous label. The decoding can be then performed by a dynamic programming algorithm. NPFL114, Lecture 9 CRF CTC CTCDecoding Word2Vec CLEs Subword Embeddings 4/30 Maximum Entropy Markov Models However, MEMMs suffer from a so-called label bias problem. Because the probability is factorized, each is a distribution and must sum to one. P (yi∣ X, yi−1) Imagine there was a label error during prediction. In the next step, the model might “realize” that the previous label has very low probability of being followed by any label – however, it cannot express this by setting the probability of all following labels low, it has to “conserve the mass”. NPFL114, Lecture 9 CRF CTC CTCDecoding Word2Vec CLEs Subword Embeddings 5/30 Conditional Random Fields Let G = (V , E) be a graph such that y is indexed by vertices of G.
    [Show full text]
  • PDF Hosted at the Radboud Repository of the Radboud University Nijmegen
    PDF hosted at the Radboud Repository of the Radboud University Nijmegen The following full text is a publisher's version. For additional information about this publication click this link. http://hdl.handle.net/2066/179506 Please be advised that this information was generated on 2021-09-30 and may be subject to change. Algorithmic composition of polyphonic music with the WaveCRF Umut Güçlü*, Yagmur˘ Güçlütürk*, Luca Ambrogioni, Eric Maris, Rob van Lier, Marcel van Gerven, Radboud University, Donders Institute for Brain, Cognition and Behaviour Nijmegen, the Netherlands {u.guclu, y.gucluturk}@donders.ru.nl *Equal contribution Abstract Here, we propose a new approach for modeling conditional probability distributions of polyphonic music by combining WaveNET and CRF-RNN variants, and show that this approach beats LSTM and WaveNET baselines that do not take into account the statistical dependencies between simultaneous notes. 1 Introduction The history of algorithmically generated music dates back to the musikalisches würfelspiel (musical dice game) of the eighteenth century. These early methods share with modern machine learning techniques the emphasis on stochasticity as a crucial ingredient for “machine creativity”. In recent years, artificial neural networks dominated the creative music generation literature [3]. In several of the most successful architectures, such as WaveNet, the output of the network is an array that specifies the probabilities of each note given the already produced composition [3]. This approach tends to be infeasible in the case of polyphonic music because the network would need to output a probability value for every possible simultaneous combination of notes. The naive strategy of producing each note independently produces less realistic compositions as it ignores the statistical dependencies between simultaneous notes.
    [Show full text]
  • Conditional Random Fields Meet Deep Neural Networks for Semantic
    IEEE SIGNAL PROCESSING MAGAZINE, VOL. XX, NO. XX, JANUARY 2018 1 Conditional Random Fields Meet Deep Neural Networks for Semantic Segmentation Anurag Arnab∗, Shuai Zheng∗, Sadeep Jayasumana, Bernardino Romera-Paredes, Mans˚ Larsson, Alexander Kirillov, Bogdan Savchynskyy, Carsten Rother, Fredrik Kahl and Philip Torr Abstract—Semantic Segmentation is the task of labelling every object category. Semantic Segmentation, the main focus of this pixel in an image with a pre-defined object category. It has numer- article, aims for a more precise understanding of the scene by ous applications in scenarios where the detailed understanding assigning an object category label to each pixel within the of an image is required, such as in autonomous vehicles and image. Recently, researchers have also begun tackling new medical diagnosis. This problem has traditionally been solved scene understanding problems such as instance segmentation, with probabilistic models known as Conditional Random Fields which aims to assign a unique identifier to each segmented (CRFs) due to their ability to model the relationships between the pixels being predicted. However, Deep Neural Networks object in the image, as well as bridging the gap between natural (DNNs) have recently been shown to excel at a wide range of language processing and computer vision with tasks such as computer vision problems due to their ability to learn rich feature image captioning and visual question answering, which aim at representations automatically from data, as opposed to traditional describing an image in words, and answering textual questions hand-crafted features. The idea of combining CRFs and DNNs from images respectively. have achieved state-of-the-art results in a number of domains.
    [Show full text]
  • Evaluating the Utility of Hand-Crafted Features in Sequence Labelling∗
    Evaluating the Utility of Hand-crafted Features in Sequence Labelling∗ Minghao Wu♠♥y Fei Liu♠ Trevor Cohn♠ ♠The University of Melbourne, Victoria, Australia ~JD AI Research, Beijing, China [email protected] [email protected] [email protected] Abstract Orthogonal to the advances in deep learning is the effort spent on feature engineering. A rep- Conventional wisdom is that hand-crafted fea- resentative example is the task of named entity tures are redundant for deep learning mod- recognition (NER), one that requires both lexi- els, as they already learn adequate represen- cal and syntactic knowledge, where, until recently, tations of text automatically from corpora. In this work, we test this claim by proposing a most models heavily rely on statistical sequential new method for exploiting handcrafted fea- labelling models taking in manually engineered tures as part of a novel hybrid learning ap- features (Florian et al., 2003; Chieu and Ng, 2002; proach, incorporating a feature auto-encoder Ando and Zhang, 2005). Typical features include loss component. We evaluate on the task of POS and chunk tags, prefixes and suffixes, and ex- named entity recognition (NER), where we ternal gazetteers, all of which represent years of show that including manual features for part- accumulated knowledge in the field of computa- of-speech, word shapes and gazetteers can im- tional linguistics. prove the performance of a neural CRF model. The work of Collobert et al.(2011) started We obtain a F1 of 91.89 for the CoNLL-2003 English shared task, which significantly out- the trend of feature engineering-free modelling performs a collection of highly competitive by learning internal representations of compo- baseline models.
    [Show full text]
  • Deep Recurrent Conditional Random Field Network for Protein Secondary Prediction
    Deep Recurrent Conditional Random Field Network for Protein Secondary Prediction Johansen, Alexander Rosenberg; Sønderby, Casper Kaae; Sønderby, Søren Kaae; Winther, Ole Published in: ACM-BCB '17 DOI: 10.1145/3107411.3107489 Publication date: 2017 Citation for published version (APA): Johansen, A. R., Sønderby, C. K., Sønderby, S. K., & Winther, O. (2017). Deep Recurrent Conditional Random Field Network for Protein Secondary Prediction. In ACM-BCB '17: Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics (pp. 73-78) https://doi.org/10.1145/3107411.3107489 Download date: 28. Sep. 2021 Session 3: Proteins and RNA Structure, Dynamics, and Analysis I ACM-BCB’17, August 20–23, 2017, Boston, MA, USA Deep Recurrent Conditional Random Field Network for Protein Secondary Prediction Alexander Rosenberg Johansen Casper Kaae Sønderby Technical University of Denmark, DTU Compute University of Copenhagen, Bioinformatics Centre [email protected] [email protected] Søren Kaae Sønderby Ole Winther University of Copenhagen, Bioinformatics Centre Technical University of Denmark, DTU Compute [email protected] [email protected] ABSTRACT methods [18] has emerged and received state of the art perfor- Deep learning has become the state-of-the-art method for predict- mance [6, 19, 26, 29, 31]. These approaches have arisen as the chal- ing protein secondary structure from only its amino acid residues lenge of annotating secondary protein structures is predictable from and sequence prole. Building upon these results, we propose to conformations along the sequence of the amino acid and sequence combine a bi-directional recurrent neural network (biRNN) with proles [29].
    [Show full text]
  • Conditional Random Fields by Charles Sutton and Andrew Mccallum
    Foundations and TrendsR in Machine Learning Vol. 4, No. 4 (2011) 267–373 c 2012 C. Sutton and A. McCallum DOI: 10.1561/2200000013 An Introduction to Conditional Random Fields By Charles Sutton and Andrew McCallum Contents 1 Introduction 268 1.1 Implementation Details 271 2 Modeling 272 2.1 Graphical Modeling 272 2.2 Generative versus Discriminative Models 278 2.3 Linear-chain CRFs 286 2.4 General CRFs 290 2.5 Feature Engineering 293 2.6 Examples 298 2.7 Applications of CRFs 306 2.8 Notes on Terminology 308 3 Overview of Algorithms 310 4 Inference 313 4.1 Linear-Chain CRFs 314 4.2 Inference in Graphical Models 318 4.3 Implementation Concerns 328 5 Parameter Estimation 331 5.1 Maximum Likelihood 332 5.2 Stochastic Gradient Methods 341 5.3 Parallelism 343 5.4 Approximate Training 343 5.5 Implementation Concerns 350 6 Related Work and Future Directions 352 6.1 Related Work 352 6.2 Frontier Areas 359 Acknowledgments 362 References 363 Foundations and TrendsR in Machine Learning Vol. 4, No. 4 (2011) 267–373 c 2012 C. Sutton and A. McCallum DOI: 10.1561/2200000013 An Introduction to Conditional Random Fields Charles Sutton1 and Andrew McCallum2 1 School of Informatics, University of Edinburgh, Edinburgh, EH8 9AB, UK, [email protected] 2 Department of Computer Science, University of Massachusetts, Amherst, MA, 01003, USA, [email protected] Abstract Many tasks involve predicting a large number of variables that depend on each other as well as on other observed variables. Structured prediction methods are essentially a combination of classification and graphical modeling.
    [Show full text]
  • Fast and Effective Biomedical Named Entity Recognition Using Temporal Convolutional Network with Conditional Random Field
    MBE, 17(4): 3553–3566. DOI: 10.3934/mbe.2020200 Received: 31 December 2019 Accepted: 20 April 2020 http://www.aimspress.com/journal/MBE Published: 12 May 2020 Research article Fast and effective biomedical named entity recognition using temporal convolutional network with conditional random field Chao Che1;*, Chengjie Zhou1, Hanyu Zhao1, Bo Jin2, and Zhan Gao3 1 Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, Dalian University, Dalian 116622, China 2 School of Innovation and Entrepreneurship, Dalian University of Technology, Dalian 116024, China 3 Beijing Haoyisheng Cloud Hospital Management Technology Ltd. Beijing 100120, China * Correspondence: Tel: +86-041187402046; Email: [email protected]. Abstract: Biomedical named entity recognition (Bio-NER) is the prerequisite for mining knowledge from biomedical texts. The state-of-the-art models for Bio-NER are mostly based on bidirectional long short-term memory (BiLSTM) and bidirectional encoder representations from transformers (BERT) models. However, both BiLSTM and BERT models are extremely computationally intensive. To this end, this paper proposes a temporal convolutional network (TCN) with a conditional random field (TCN-CRF) layer for Bio-NER. The model uses TCN to extract features, which are then decoded by the CRF to obtain the final result. We improve the original TCN model by fusing the features extracted by convolution kernel with different sizes to enhance the performance of Bio-NER. We compared our model with five deep learning models on the GENIA and CoNLL-2003 datasets. The experimental results show that our model can achieve comparative performance with much less training time. The implemented code has been made available to the research community.
    [Show full text]
  • Thesis Submitted for the Requirements of The
    Robust Biomedical Name Entity Recognition Based on Deep Machine Learning Robert Phan A Thesis submitted for the requirements of the Degree of Doctor of Philosophy Faculty of Science and Technology & UC Health Research Institute 10 June 2020 Abstract Biomedical Named Entity Recognition (Bio-NER) is a complex natural language processing (NLP) task for extracting important concepts (named entities) from biomedical texts, such as RNA (Ribonucleic acid), protein, cell type, cell line, and DNA (Deoxyribonucleic acid), and attempts to discover automatically, biomedical knowledge and clinical concepts and terms from text-based digital health records. For automatic recognition and discovery, researchers have investigated extensively different types of machine learning models, leading to sophisticated NER systems. However, most of the computer based NER systems often require manual annotations, and handcrafted features specifically designed for each type of biomedical or clinical entities. The feature generation process for biomedical and clinical NLP texts, requires extensive manual efforts, significant background knowledge from biomedical, clinical and linguistic experts, and suffer from lack of generalization capabilities for different application contexts, and difficult to adapt to new biomedical entity types or the clinical concepts and terms. Recently, there have been increasing efforts to apply different machine learning models to improve the performance and generalization capabilities of automatic Named Entity Recognition (NER) systems for different application contexts. However, their performance and robustness for biomedical and clinical contexts are far from optimal, due to the complex nature of medical texts and lack of annotated dictionaries, and requirement of multi-disciplinary experts for annotating massive training data for the manual feature generation process.
    [Show full text]
  • Conditional Random Field Autoencoders for Unsupervised Structured Prediction
    Conditional Random Field Autoencoders for Unsupervised Structured Prediction Waleed Ammar Chris Dyer Noah A. Smith School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213, USA {wammar,cdyer,nasmith}@cs.cmu.edu Abstract We introduce a framework for unsupervised learning of structured predictors with overlapping, global features. Each input’s latent representation is predicted con- ditional on the observed data using a feature-rich conditional random field (CRF). Then a reconstruction of the input is (re)generated, conditional on the latent struc- ture, using a generative model which factorizes similarly to the CRF. The autoen- coder formulation enables efficient exact inference without resorting to unrealistic independence assumptions or restricting the kinds of features that can be used. We illustrate connections to traditional autoencoders, posterior regularization, and multi-view learning. We then show competitive results with instantiations of the framework for two canonical tasks in natural language processing: part-of-speech induction and bitext word alignment, and show that training the proposed model can be substantially more efficient than a comparable feature-rich baseline. 1 Introduction Conditional random fields [24] are used to model structure in numerous problem domains, includ- ing natural language processing (NLP), computational biology, and computer vision. They enable efficient inference while incorporating rich features that capture useful domain-specific insights. De- spite their ubiquity in supervised
    [Show full text]