Learning Parameters of Undirected Models

Total Page:16

File Type:pdf, Size:1020Kb

Learning Parameters of Undirected Models Machine Learning Srihari Learning Parameters of Undirected Models Sargur Srihari [email protected] 1 Machine Learning Srihari Topics • Difficulties due to Global Normalization • Likelihood Function • Maximum Likelihood Parameter Estimation • Simple and Conjugate Gradient Ascent • Conditionally Trained Models • Parameter Learning with Missing Data – Gradient Ascent vs. EM • Alternative Formulation of Max Likelihood – Maximum Entropy subject to Constraints • Parameter Priors and Regularization 2 Machine Learning Srihari Local vs Global Normalization • BN: Local normalization within each CPD n p(X) = ∏ln p(Xi | pai ) i=1 • MN: Global normalization (partition function) 1 K Z = φi (Di ) P(X1,.., Xn ) = φi Di ∑ ∏ ( ) X ,..X Z i=1 1 n • Global factor couples all parameters preventing decomposition • Significant computational ramifications – M.L. parameter estimation has no closed-form soln. – Need iterative methods 3 Machine Learning Srihari Issues in Parameter Estimation • Simple ML parameter estimation (even with complete data) cannot be solved in closed form • Need iterative methods such as gradient ascent • Good news: – Likelihood function is concave • Methods converge with global optimum • Bad news: – Each step in iterative algorithm requires inference • Simple parameter estimation expensive/intractable 4 – Bayesian estimation is practically infeasible Machine Learning Srihari Discriminative Training • Common use of MNs is in settings such as image segmentation where we have a particular inference task in mind • Train the network discriminatively to get good performance for our particular inference task 5 Machine Learning Srihari Likelihood Function • Basis for all discussion of learning • How likelihood function can be optimized to find maximum likelihood parameter estimates • Begin with form of likelihood function for Markov networks, its properties and their computational implications 6 Machine Learning Srihari Example of Coupled Likelihood Function – Simple Markov network A—B—C – Two potentials ϕ1(A,B), ϕ2(B,C): Z = (a,b) (b,c) • Gibbs: P(A,B,C)=(1/Z) ϕ1(A,B)ϕ2(B,C) ∑φ1 φ2 a,b,c • Log-likelihood of instance (a,b,c) is ln P(a,b,c)=ln ϕ1(A,B)+ln ϕ2(B,C)-ln Z • Log-likelihood of data set D with M instances: Parameter θ consists of all values of factorsϕ1 and ϕ2 M Summing over different ℓ(θ : D) = ∑(lnφ1(a[m],b[m]) + lnφ2 (b[m],c[m]) − ln Z(θ)) m=1 Instances M = M[a,b]lnφ (a,b) + M[b,c]lnφ (b,c) − M ln Z(θ) Summing over different ∑ 1 ∑ 2 values of A and B a,b b,c • First term involves only ϕ1. Second only ϕ2. But third involves ⎛ ⎞ lnZ(θ) = ln⎜ φ (a,b)φ (b,c)⎟ ⎜∑ 1 2 ⎟ ⎝a,b,c ⎠ • Which couples the two potentials in the likelihood function – When we change one potential ϕ1, Z(θ) changes, possibly changing the value of ϕ2 that maximizes –ln Z(θ) Machine Learning Srihari Illustration of Coupled Likelihood • Log-likelihood A—B—C wrt two factors ϕ1(A,B), ϕ2(B,C) – With binary variables we would have 8 parameters ℓ(θ : D) = M[a,b]lnφ (a,b) + M[b,c]lnφ (b,c) − M ln Z(θ) • Log-likelihood surface ∑ 1 ∑ 2 a,b b,c – when ϕ1changes, ϕ2 that maximizes –ln Z(θ) also changes • In this case problem avoidable 0 1 f2(b , c ) 1 1 – Equivalent to BN f1(a , b ) All other parameters set to 1 – AàBàC & estimate parameters of Data Set has M=100 M[a1,b1]=40, M[b0,c1]=40 – ϕ1(A,B)=P(A)P(B|A), ϕ2(B,C)=P(C|B) • In general, cannot convert learned BN parameters into equivalent MN – Optimal likelihood achievable by the two representations is not the same 8 Machine Learning Srihari Form of Likelihood Function • Instead of Gibbs, use log-linear framework – Joint distribution of n variables X1,..Xn – k features F = { fi(Di) }i=1,.. k k depends on how many values Di take • where Di is a sub-graph and fi maps Di to R 1 ⎧ k ⎫ P(X1,..Xn;θ) = exp ⎨∑θi fi (Di )⎬ Z(θ) ⎩ i=1 ⎭ • Parameters θi are weights we put on features – If we have a sample ξ then its features are fi(ξ(Di)) which has the shorthand fi(ξ). • Representation is general – can capture Markov networks with global structure 9 and local structure MachineParameters Learning θ, Factors ϕ, binary features Sriharif A B ϕ1(A,B) k 1 0 0 0 0 A A ⎧ ⎫ a b ϕ1 (a ,b ) P(X1,..Xn;θ) = exp ⎨∑θi fi (Di )⎬ Z(θ) i=1 0 1 0 1 ⎩ ⎭ a b ϕ1(a ,b ) D B D B D B a1 b0 ϕ (a1,b0) • Variables: A—B—C—D— 1 C C C A 1 1 1, 1 a b ϕ1(a b ) (a) (b) (c) • A feature for every entry in every table – fi(Di) are sixteen indicator functions defined over clusters, AB, BC,CD,DA 0 0 Val(A)={a 0,a1} Val(B)={b 0,b1} f 0 0 (A, B) I A a I B b a b = { = } { = } 0 0 fa0,b0=1 if a=a ,b=b etc. 0 otherwise, etc. – With this representation Parameters θ are potentials 0 0 which are weights put on θ 0 0 = lnφ a ,b a b 1 ( ) features Machine Learning Srihari Log Likelihood and Sufficient Statistics • Joint probability distribution: 1 ⎧ k ⎫ θ={θ1,..θk} are table entries P(X ,..X ; ) exp f (D ) 1 n θ = ⎨∑θi i i ⎬ fi are features over instances of Di Z(θ) ⎩ i=1 ⎭ • Let D be a data set of M samples ξ [m] m =1,..M • Log-likelihood – Log of product of probs of M indep. instances: ⎛ ⎞ ℓ(θ : D) = θi fi (ξ[m]) − M ln Z(θ) ∑ ⎝⎜ ∑ ⎠⎟ i m • Sufficient statistics (likelihood depends only on this) Dividing by no. 1 ED[fi(di)] is the average ℓ(θ : D) = ∑θi (ED [ fi (di )]) − ln Z(θ) of samples M i in the data set Machine Learning Srihari Properties of Log-Likelihood • Log-likelihood is a sum of two functions ⎛ ⎞ ℓ(θ : D) = θi fi (ξ[m]) − M ln Z(θ) ∑ ⎝⎜ ∑ ⎠⎟ i m First Term Second Term – First term is linear in the parameters θ • Increasing parameters increases this term • But likelihood has upper-bound of probability 1 – Second term: ⎧ ⎫ ln Z (θ ) = ln∑exp⎨∑θi fi (ξ)⎬ • balances first term ξ ⎩ i ⎭ – Partition function Z(θ) is convex as seen next » Its negative is concave – Sum of linear fn. and concave is concave, So 12 • There are no local optima – Can use gradient ascent Machine Learning Srihari Proof that Z(θ) is Convex ! A function f (x) is convex if for every 0 ≤ α ≤ 1 ! ! ! ! f ( x (1 )y) f (x) (1 ) f (y) α + −α ≤ α + −α –The function is bowl-like –Every interpolation between the images of two points is larger than the image of their interpolation – One way to show that a Function is convex is to show that its Hessian (matrix of the function’s second derivatives) is positive-semi-definite • Hessian (2nd der) of ln Z(θ) computed as: Machine Learning Srihari Hessian of ln Z(θ) ⎧ ⎫ ln Z θ = ln exp θ f ξ Given set of features F with ( ) ∑ ⎨∑ i i ( )⎬ ξ ⎩ i ⎭ ∂ ln Z(θ) = Eθ [ fi ] ∂θi ∂2 ln Z(θ) = Cov [ f ; f ] θ i j where Eθ[fi] is shorthand for EP(χ,θ)[fi] ∂θi ∂θ j Proof: ∂ 1 ∂ ⎪⎧ ⎪⎫ 1 ⎪⎧ ⎪⎫ ln Z( ) exp f f ( )exp f E [ f ] θ = ∑ ⎨∑θ j j (ξ)⎬ = ∑ i ξ ⎨∑θ j j (ξ)⎬ = θ i ∂θi Z(θ) ξ ∂θi ⎩⎪ j ⎭⎪ Z(θ) ξ ⎩⎪ j ⎭⎪ Partial derivatives ∂2 ∂ ⎡ 1 ∂ ⎧ ⎫⎤ wrt θi and θj ln Z( ) exp f Cov [ f ; f ] θ = ⎢ ∑ ⎨∑θk k (ξ)⎬⎥ = θ i j ∂θi∂θ j ∂θ j ⎣ Z(θ) ξ ∂θi ⎩ k ⎭⎦ – Since covariance matrix of features is pos. semi- definite, we have -ln Z(θ) is a concave function of θ • Corollary: log-likelihood function is concave Machine Learning Srihari Non-unique Solution • Since ln Z(θ) is convex, -ln Z(θ) is concave • Implies that log-likelihood is unimodal – Has no local optima • However does not imply uniqueness of global optimum • Multiple parameterizations can result in same distribution – A feature for every entry in the table is always redundant, e.g., • fa0,b0 = 1- fa0,b1- fa1,b0- fa1,b1 15 • A continuum of parameterizations Machine Learning Srihari Maximum (Conditional) Likelihood Parameter Estimation • Task: Estimate parameters of a Markov network with a fixed structure given a fully observable data set D • Simplest variant of the problem is maximum likelihood parameter estimation • Log-likelihood given features F={ fi, i=1,.. k} is ⎛ ⎞ ℓ(θ : D) = ∑θi ⎜ ∑ fi (ξ[m])⎟ − M ln Z (θ ) i=1 ⎝ m ⎠ 16 Machine Learning Srihari Gradient of Log-likelihood ⎛ ⎞ ℓ(θ : D) = θi fi (ξ[m]) − M ln Z (θ ) ∑ ⎝⎜ ∑ ⎠⎟ • Log-likelihood: i=1 m 1 ( : D) E f (d ) ln Z( ) Dividing by no. • Average log-likelihood: ℓ θ = ∑θi ( D [ i i ]) − θ M i of samples • Gradient of second term is ∂ 1 ∂ ⎧ ⎫ 1 ⎧ ⎫ ln Z(θ) = ∑ exp ⎨∑θ j f j (ξ)⎬ = ∑ fi (ξ)exp ⎨∑θ j f j (ξ)⎬ = Eθ [ fi ] ∂θi Z(θ) ξ ∂θi ⎩⎪ j ⎭⎪ Z(θ) ξ ⎩⎪ j ⎭⎪ • Gradient of the average log-likelihood is First term is average value of f in ∂ 1 i ℓ(θ : D) = ED[ fi (χ )]− Eθ [ fi ] data D. Second term is ∂θ M i expected value from distribution • Provides a precise characterization of m.l. parameters θ • Theorem: Let F be a feature set. Then θ is a m.l. assignment if and only if E f E f for all i D ⎣⎡ i (χ)⎦⎤ = θˆ [ i ] • i.e., expected value of each feature relative to Pθ matches its empirical expectation in D Machine Learning Srihari Need for Iterative Method • Although log-likelihood function ⎛ ⎞ ℓ(θ : D) = θi fi (ξ[m]) − M ln Z(θ) ∑ ⎝⎜ ∑ ⎠⎟ i m • is concave, there is no analytical form for the maximum • Since no closed-form solution – Can use iterative methods, e.g.
Recommended publications
  • Information Theory, Pattern Recognition and Neural Networks Part III Physics Exams 2006
    Part III Physics exams 2004–2006 Information Theory, Pattern Recognition and Neural Networks Part III Physics exams 2006 1 A channel has a 3-bit input, x 000, 001, 010, 011, 100, 101, 110, 111 , ∈{ } and a 2-bit output y 00, 01, 10, 11 . Given an input x, the output y is ∈{ } generated by deleting exactly one of the three input bits, selected at random. For example, if the input is x = 010 then P (y x) is 1/3 for each of the outputs 00, 10, and 01; If the input is x = 001 then P (y=|01 x)=2/3 and P (y=00 x)=1/3. | | Write down the conditional entropies H(Y x=000), H(Y x=010), and H(Y x=001). | | [3] |Assuming an input distribution of the form x 000 001 010 011 100 101 110 111 1 p p p p p 1 p P (x) − 0 0 − , 2 4 4 4 4 2 work out the conditional entropy H(Y X) and show that | 2 H(Y )=1+ H p , 2 3 where H2(x)= x log2(1/x)+(1 x)log2(1/(1 x)). [3] Sketch H(Y ) and H(Y X−) as a function− of p (0, 1) on a single diagram. [5] | ∈ Sketch the mutual information I(X; Y ) as a function of p. [2] H2(1/3) 0.92. ≃ Another channel with a 3-bit input x 000, 001, 010, 011, 100, 101, 110, 111 , ∈{ } erases exactly one of its three input bits, marking the erased symbol by a ?.
    [Show full text]
  • On Measures of Entropy and Information
    On Measures of Entropy and Information Tech. Note 009 v0.7 http://threeplusone.com/info Gavin E. Crooks 2018-09-22 Contents 5 Csiszar´ f-divergences 12 Csiszar´ f-divergence ................ 12 0 Notes on notation and nomenclature 2 Dual f-divergence .................. 12 Symmetric f-divergences .............. 12 1 Entropy 3 K-divergence ..................... 12 Entropy ........................ 3 Fidelity ........................ 12 Joint entropy ..................... 3 Marginal entropy .................. 3 Hellinger discrimination .............. 12 Conditional entropy ................. 3 Pearson divergence ................. 14 Neyman divergence ................. 14 2 Mutual information 3 LeCam discrimination ............... 14 Mutual information ................. 3 Skewed K-divergence ................ 14 Multivariate mutual information ......... 4 Alpha-Jensen-Shannon-entropy .......... 14 Interaction information ............... 5 Conditional mutual information ......... 5 6 Chernoff divergence 14 Binding information ................ 6 Chernoff divergence ................. 14 Residual entropy .................. 6 Chernoff coefficient ................. 14 Total correlation ................... 6 Renyi´ divergence .................. 15 Lautum information ................ 6 Alpha-divergence .................. 15 Uncertainty coefficient ............... 7 Cressie-Read divergence .............. 15 Tsallis divergence .................. 15 3 Relative entropy 7 Sharma-Mittal divergence ............. 15 Relative entropy ................... 7 Cross entropy
    [Show full text]
  • Information Theory and Entropy
    3 Information Theory and Entropy Solomon Kullback (1907–1994) was born in Brooklyn, New York, USA, and graduated from the City College of New York in 1927, received an M.A. degree in mathematics in 1929, and completed a Ph.D. in mathematics from the George Washington University in 1934. Kully as he was known to all who knew him, had two major careers: one in the Defense Department (1930–1962) and the other in the Department of Statistics at George Washington University (1962–1972). He was chairman of the Statistics Department from 1964–1972. Much of his pro- fessional life was spent in the National Security Agency and most of his work during this time is still classified. Most of his studies on information theory were done during this time. Many of his results up to 1958 were published in his 1959 book, “Information Theory and Statistics.” Additional details on Kullback may be found in Greenhouse (1994) and Anonymous (1997). When we receive something that decreases our uncertainty about the state of the world, it is called information. Information is like “news,” it informs. Informa- tion is not directly related to physical quantities. Information is not material and is not a form of energy, but it can be stored and communicated using material or energy means. It cannot be measured with instruments but can be defined in terms of a probability distribution. Information is a decrease in uncertainty. 52 3. Information Theory and Entropy This textbook is about a relatively new approach to empirical science called “information-theoretic.” The name comes from the fact that the foundation originates in “information theory”; a set of fundamental discoveries made largely during World War II with many important extensions since that time.
    [Show full text]
  • Lecture Notes (Chapter
    Chapter 12 Information Theory So far in this course, we have discussed notions such as “complexity” and “in- formation”, but we have not grounded these ideas in formal detail. In this part of the course, we will introduce key concepts from the area of information the- ory, such as formal definitions of information and entropy, as well as how these definitions relate to concepts previously discussed in the course. As we will see, many of the ideas introduced in previous chapters—such as the notion of maximum likelihood—can be re-framed through the lens of information theory. Moreover, we will discuss how machine learning methods can be derived from a information-theoretic perspective, based on the idea of maximizing the amount of information gain from training data. In this chapter, we begin with key definitions from information theory, as well as how they relate to previous concepts in this course. In this next chapter, we will introduce a new supervised learning technique—termed decision trees— which can be derived from an information-theoretic perspective. 12.1 Entropy and Information Suppose we have a discrete random variable x. The key intuition of information theory is that we want to quantify how much information this random variable conveys. In particular, we want to know how much information we obtain when we observe a particular value for this random variable. This idea is often motivated via the notion of surprise: the more surprising an observation is, the more information it contains. Put in another way, if we observe a random event that is completely predictable, then we do not gain any information, since we could already predict the result.
    [Show full text]
  • Stan Modeling Language
    Stan Modeling Language User’s Guide and Reference Manual Stan Development Team Stan Version 2.12.0 Tuesday 6th September, 2016 mc-stan.org Stan Development Team (2016) Stan Modeling Language: User’s Guide and Reference Manual. Version 2.12.0. Copyright © 2011–2016, Stan Development Team. This document is distributed under the Creative Commons Attribution 4.0 International License (CC BY 4.0). For full details, see https://creativecommons.org/licenses/by/4.0/legalcode The Stan logo is distributed under the Creative Commons Attribution- NoDerivatives 4.0 International License (CC BY-ND 4.0). For full details, see https://creativecommons.org/licenses/by-nd/4.0/legalcode Stan Development Team Currently Active Developers This is the list of current developers in order of joining the development team (see the next section for former development team members). • Andrew Gelman (Columbia University) chief of staff, chief of marketing, chief of fundraising, chief of modeling, max marginal likelihood, expectation propagation, posterior analysis, RStan, RStanARM, Stan • Bob Carpenter (Columbia University) language design, parsing, code generation, autodiff, templating, ODEs, probability functions, con- straint transforms, manual, web design / maintenance, fundraising, support, training, Stan, Stan Math, CmdStan • Matt Hoffman (Adobe Creative Technologies Lab) NUTS, adaptation, autodiff, memory management, (re)parameterization, optimization C++ • Daniel Lee (Columbia University) chief of engineering, CmdStan, builds, continuous integration, testing, templates,
    [Show full text]
  • Information Theory: MATH34600
    Information Theory: MATH34600 Professor Oliver Johnson [email protected] Twitter: @BristOliver School of Mathematics, University of Bristol Teaching Block 1, 2019-20 Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 1 / 157 Why study information theory? Information theory was introduced by Claude Shannon in 1948. Paper A Mathematical Theory of Communication available via course webpage. It forms the basis for our modern world. Information theory is used every time you make a phone call, take a selfie, download a movie, stream a song, save files to your hard disk. Represent randomly generated `messages' by sequences of 0s and 1s Key idea: some sources of randomness are more random than others. Can compress random information (remove redundancy) to save space when saving it. Can pad out random information (introduce redundancy) to avoid errors when transmitting it. Quantity called entropy quantifies how well we can do. Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 2 / 157 Relationship to other fields (from Cover and Thomas) Now also quantum information, security, algorithms, AI/machine learning (privacy), neuroscience, bioinformatics, ecology . Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 3 / 157 Course outline 15 lectures, 3 exercise classes, 5 mandatory HW sets. Printed notes are minimal, lectures will provide motivation and help. IT IS YOUR RESPONSIBILITY TO ATTEND LECTURES AND TO ENSURE YOU HAVE A FULL SET OF NOTES AND SOLUTIONS Course webpage for notes, problem sheets, links, textbooks etc: https://people.maths.bris.ac.uk/∼maotj/IT.html Drop-in sessions: 11.30-12.30 on Mondays, G83 Fry Building.
    [Show full text]
  • Arxiv:1703.01694V2 [Cs.CL] 31 May 2017 Tinctive Forms to Differentiate Each Word from Others, Especially If a Speaker’S Intended Word Has Higher-Frequency Competitors
    Word forms|not just their lengths|are optimized for efficient communication Stephan C. Meylan1 and Thomas L. Griffiths1 1University of California, Berkeley, Psychology Abstract The inverse relationship between the length of a word and the frequency of its use, first identified by G.K. Zipf in 1935, is a classic empirical law that holds across a wide range of human languages. We demonstrate that length is one aspect of a much more general property of words: how distinctive they are with respect to other words in a language. Distinctiveness plays a critical role in recognizing words in fluent speech, in that it reflects the strength of potential competitors when selecting the best candidate for an ambiguous signal. Phonological information content, a measure of a word's string probability under a statistical model of a language's sound or character sequences, concisely captures distinctiveness. Examining large- scale corpora from 13 languages, we find that distinctiveness significantly outperforms word length as a predictor of frequency. This finding provides evidence that listeners' processing constraints shape fine-grained aspects of word forms across languages. Despite their apparent diversity, natural languages display striking structural regularities [1{3]. How such regularities relate to human cognition remains an open question with implications for linguistics, psychology, and neuroscience [2, 4{6]. Prominent among these regularities is the well- known relationship between word length and frequency: across languages, frequently-used words tend to be short [7]. In a classic work, Zipf [7] posited that this pattern emerges from speakers minimizing total articulatory effort by using the shortest form for words that are used most often, following what he later called the Principle of Least Effort [8].
    [Show full text]
  • Computing Entropies with Nested Sampling
    Entropy 2017, , 1-; doi: To be determined OPEN ACCESS entropy ISSN 1099-4300 www.mdpi.com/journal/entropy Article Computing Entropies With Nested Sampling Brendon J. Brewer Department of Statistics, The University of Auckland, Private Bag 92019, Auckland 1142, New Zealand Accepted 08/2017 Abstract: The Shannon entropy, and related quantities such as mutual information, can be used to quantify uncertainty and relevance. However, in practice, it can be difficult to compute these quantities for arbitrary probability distributions, particularly if the probability mass functions or densities cannot be evaluated. This paper introduces a computational approach, based on Nested Sampling, to evaluate entropies of probability distributions that can only be sampled. I demonstrate the method on three examples: a simple gaussian example where the key quantities are available analytically; (ii) an experimental design example about scheduling observations in order to measure the period of an oscillating signal; and (iii) predicting the future from the past in a heavy-tailed scenario. Keywords: information theory; entropy; mutual information; monte carlo; nested sampling; bayesian inference 1. Introduction If an unknown quantity x has a discrete probability distribution p(x), the Shannon entropy [1,2] is defined as arXiv:1707.03543v2 [stat.CO] 16 Aug 2017 X H(x) = p(x) log p(x) (1) − x where the sum is over all of the possible values of x under consideration. The entropy quantifies the degree to which the issue “what is the value of x, precisely?” remains unresolved [3]. More generally, the ‘Kullback-Leibler divergence’ quantifies the degree to which a probability distribution p(x) departs from a ‘base’ or ‘prior’ probability distribution q(x): X p(x) D p q = p(x) log (2) KL q(x) jj x Entropy 2017, 2 [4,5].
    [Show full text]
  • Lecture 6; Using Entropy for Evaluating and Comparing Probability Distributions Readings: Jurafsky and Martin, Section 6.7 Manning and Schutze, Section 2.2
    Lecture 6; Using Entropy for Evaluating and Comparing Probability Distributions Readings: Jurafsky and Martin, section 6.7 Manning and Schutze, Section 2.2 So far we have used one method for evaluating probability distributions – based on the idea of maximizing the likelihood of the observed data. As we look at other application, we will need a richer set of tools. We might, for instance, have two probability functions and want to measure how “close” they are to each other. We also further develop the notion of measuring how well a probability distribution fits observed data and develop techniques that allow us to compare inherently different formulations (e.g., say a probability distribution base don words vs one based on letters). The key concept used here is called entropy, which is a measure of the inherent randomness in a probability distribution (or set of observed data). This is the key concept in a field of research called information theory. Today we’ll look at the basic of Information Theory and see how the concepts can be applied to natural language processing. 1. Entropy: A Foundation of Information Theory Information theory can be viewed as a way to measure and reason about the complexity of messages. In our case we will be interested in natural language messages, but information theory applies to any form of messages. Consider you are designing a system to transmit information as efficiently as possible, i.e., use the least number of bits. How many bits are needed clearly depends on what we know about the content. For instance, say we know a message is the answer to a yes/no question.
    [Show full text]
  • Similarity-Based Approaches to Natural Language Processing
    Similarity-Based Approaches to Natural Language Processing The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters Citation Lee, Lillian Jane. 1997. Similarity-Based Approaches to Natural Language Processing. Harvard Computer Science Group Technical Report TR-11-97. Citable link http://nrs.harvard.edu/urn-3:HUL.InstRepos:25104914 Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA SimilarityBased Approaches to Natural Language Pro cessing Lillian Jane Lee TR Center for Research in Computing Technology Harvard University Cambridge Massachusetts SimilarityBased Approaches to Natural Language Pro cessing A thesis presented by Lilli an Jane Lee to The Division of Engineering and Applied Sciences in partial fulllment of the requirements for the degree of Do ctor of Philosophy in the sub ject of Computer Science Harvard University Cambridge Massachusetts May c by Lillian Jane Lee All rights reserved ii Abstract Statistical metho ds for automatically extracting information ab out asso ciations b etween words or do cuments from large collections of text have the p otential to have considerable impact in a number of areas such as information retrieval and naturallanguagebased user interfaces However even huge b o dies of text yield highly unreliable estimates of the
    [Show full text]
  • Probability for Linguists Probabilités Pour Les Linguistes
    Mathématiques et sciences humaines Mathematics and social sciences 180 | hiver 2007 Mathématiques et phonologie Probability for linguists Probabilités pour les linguistes John Goldsmith Édition électronique URL : http://journals.openedition.org/msh/7933 DOI : 10.4000/msh.7933 ISSN : 1950-6821 Éditeur Centre d’analyse et de mathématique sociales de l’EHESS Édition imprimée Date de publication : 1 décembre 2007 Pagination : 73-98 ISSN : 0987-6936 Référence électronique John Goldsmith, « Probability for linguists », Mathématiques et sciences humaines [En ligne], 180 | hiver 2007, mis en ligne le 21 février 2008, consulté le 23 juillet 2020. URL : http://journals.openedition.org/ msh/7933 ; DOI : https://doi.org/10.4000/msh.7933 © École des hautes études en sciences sociales Math. & Sci. hum. / Mathematics and Social Sciences (45e année, n◦ 180, 2007(4), p. 73–98) PROBABILITY FOR LINGUISTS John GOLDSMITH1 résumé – Probabilités pour les linguistes Nous présentons une introduction à la théorie des probabilités pour les linguistes. À partir d’exem- ples choisis dans le domaine de la linguistique, nous avons pris le parti de nous focaliser sur trois points : la différence conceptuelle entre probabilité et fréquence, l’utilité d’une évaluation de type maximum de vraisemblance dans l’évaluation des analyses linguistiques, et la divergence de Kullback-Leibler. mots clés – Bigramme, Entropie, Probabilité, Unigramme summary – This paper offers a gentle introduction to probability for linguists, assuming little or no background beyond what one learns in high school. The most important points that we emphasize are: the conceptual difference between probability and frequency, the use of maximizing probability of an observation by considering different models, and Kullback-Leibler divergence.
    [Show full text]
  • Probability for Linguists
    Probability for linguists John Goldsmith CNRS/MoDyCo and The University of Chicago Abstract This paper offers a gentle introduction to probability for linguists, as- suming little or no background beyond what one learns in high school. The most important points that we emphasize are: the conceptual differ- ence between probability and frequency, the use of maximizing probability of an observation by considering different models, and Kullback-Leibler divergence. Nous offrons une introduction ´el´ementaire `ala th´eorie des probabilit´es pour les linguistes. En tirant nos exemples de domaines linguistiques, nous essayons de mettre en valeur l’utilit´ede comprendre la diff´erence entre les probabilit´es et les fr´equences, l’´evaluation des analyses linguistiques par la calculation de la probabilit´equelles assignent aux donn´ees observ´ees, et la divergence Kullback-Leibler. 1 Introduction Probability is playing an increasingly large role in computational linguistics and machine learning, and I expect that it will be of increasing importance as time goes by.1 This presentation is designed as an introduction, to linguists, of some of the basics of probability. If you’ve had any exposure to probability at all, you’re likely to think of cases like rolling dice. If you roll one die, there’s a 1 in 6 chance—about 0.166—of rolling a “1”, and likewise for the five other normal outcomes of rolling a die. Games of chance, like rolling dice and tossing coins, are important illustrative cases in most introductory presentations of what probability is about. This is only natural; the study of probability arose through the analysis of games of chance, only becoming a bit more respectable when it was used to form the rational basis for the insurance industry.
    [Show full text]