Learning in Natural Language Dan Roth* Department of Computer Science University of Illinois at Urbana-Champaign Urbana, IL 61801 [email protected]

Learning in Natural Language Dan Roth* Department of Computer Science University of Illinois at Urbana-Champaign Urbana, IL 61801 [email protected] Abstract where hi is the relevant history when predicting Wi. Note that in this representation s can be any sequence of to• Statistics-based classifiers in natural language kens, words, part-of-speech (pos) tags. etc. are developed typically by assuming a genera• tive model for the data, estimating its param• This general scheme has been used to derive classifiers eters from training data and then using Bayes for a variety of natural language applications including rule to obtain a classifier. For many problems speech applications [Rabiner, 1989], part of speech tag• the assumptions made by the generative models ging [Kupiec, 1992; Schiitze, 1995], word-sense disam• are evidently wrong, leaving open the question biguation [Gale et al., 1993], context-sensitive spelling of why these approaches work. correction [Golding, 1995] and others. While the use of This paper presents a learning theory account Bayes rule is harmless, most of the work in statistical of the major statistical approaches to learning language modeling and ambiguity resolution is devoted in natural language. A class of Linear Statis- to estimating terms of the form Pr(w\h). The genera• tical Queries (LSQ) hypotheses is defined and tive models used to estimate these terms typically make learning with it is shown to exhibit some ro- Markov or other independence assumptions. It is evident bustness properties. Many statistical learn• from looking at language data that these assumptions are ers used in natural language, including naive often patently false and that there are significant global Bayes, Markov Models and Maximum Entropy dependencies both within and across sentences. For ex• models are shown to be LSQ hypotheses, ex• ample, when using (Hidden) Markov Model (HMM) as a plaining the robustness of these predictors even generative model for the problem of part-of-speech tag• when the underlying probabilistic assumptions ging, estimating the probability of a sequence of tags in• do not hold. This coherent view of when and volves assuming that the part of speech tag ti of the word why learning approaches work in this context Wi is independent of other words in the sentence, given may help to develop better learning methods the preceding tag ti-1. It is not surprising therefore that making these assumptions results in a poor estimate of and an understanding of the role of learning in 1 natural language inferences. the probability distribution density function . However, classifiers built based on these false assumptions never• theless seem to behave quite robustly in many cases. 1 Introduction In this paper we develop a learning theory account Generative probability models provide a principled way for this phenomenon. We show that a variety of models to the study of statistical classification in complex do• used for learning in Natural Language make their predic• mains such as natural language. It is common to assume tion using Linear Statistical Queries (LSQ) hypotheses. a generative model for such data, estimate its parame• This is a family of linear predictors over a set of fairly ters from training data and then use Bayes rule to ob• simple features which are directly related to the indepen• tain a classifier for this model. In the context of natural dence assumptions of the probabilistic model assumed. language most classifiers are derived from probabilistic We claim that the success of classification methods which language models which estimate the probability of a sen• are derived from incorrect probabilistic density estima• tence s, say, using Bayes rule, and then decompose this tion is due to the combination of two factors: probability into a product of conditional probabilities • The low expressive power of the derived classifier. according to the generative model assumptions. • Robustness properties shared by all linear statistical queries hypotheses. "Research supported by NSF grants IIS-9801638 and 1 Experimental evidence to that effect will not be included SBR-9873450. in this extended abstract for lack of space (but see Sec. 5). 898 NATURAL LANGUAGE PROCESSING We define the class of LSQ hypotheses and prove these ity, a hypothesis with true error not larger than claims. Namely, we show that since the hypotheses are A learner that can carry out this computed over a feature space chosen so that they per• task for any distribution D is called an efficient (agnos- form well on training data, learning theory implies that tic) pac learner for hypothesis class if the sample size they perform well on previously unseen data, irrespec• and the time required to produce the hypothesis tive of whether the underlying probabilistic assumptions are polynomial in the relevant parameters. hold. We then show how different models used in the lit- In practice, one cannot compute the true error erature can be cast as LSQ hypotheses by selecting the and in many cases it may be hard to guaran• statistical queries (i.e., set of features) appropriately and tee that the computed hypothesis h is close to optimal in how this affects the robustness of the derived hypothesis. Instead, the input to the learning algorithm is a sam• The main contribution of the paper is in providing a ple of m labeled examples, the learner unified learning theory account for a variety of methods tries to find a hypothesis h with a small empirical error that are widely used in natural language applications. The hope is that providing a coherent explanation for when and why learning approaches work in this context and hopes that it behaves well on future examples. could help in developing better learning methods and a The hope that a classifier learned from a training set better understanding of the role of learning in natural will perform well on previously unseen examples is based language inferences. on the basic theorem of learning theory [Valiant, 1984; We first present learning theory preliminaries and then Vapnik, 1995] which, stated informally, guarantees that define the class of linear statistical queries hypotheses if the training data and the test data are sampled from and discuss their properties. In Sec. 4 we show how to the same distribution, good performance on large enough cast known probabilistic predictors in this framework, training sample guarantees good performance on the test and in Sec. 5 we discuss the difference between LSQ data (i.e., good "true" error). This is quantified in the learning and other learning algorithms. following uniform convergence result: 2 Preliminaries In an instance of pac learning [Valiant, 1984], a learner needs to determine a close approximation of an un• known target function from labeled examples of that where k is some constant and is the VC- function. The unknown function is as• dimension of the class [Vapnik, 1982], a combinatorial sumed to be an element of a known function class parameter which measures the richness of over the instance space X = and the exam• ples are distributed according to some unknown prob• In practice the sample size \S\ may be fixed, and the ability distribution D on X. A learning algorithm draws result simply indicates how much can one count on the a sample of labeled examples according to D and even• true accuracy of a hypothesis selected according to its tually outputs a hypothesis h from some hypothesis class performance on S. Also, the computational problem of choosing a hypothesis which minimizes the empirical er• The error rate of the hypothesis h is defined to be ror on S (or even approximates a minimum) may be hard err or (h) = The learner goal is to out• [Kearns et al, 1992; Hoffgen and Simon, 1992]. Finally, put, with probability at least a hypothesis notice that the main assumption in the theorem above is whose error rate is at most e, for the given error parame• that the training data and the test data are sampled from ter e and confidence parameter using sample size that the same distribution; Golding and Roth [1999] discuss is polynomial in the relevant parameters. The learner is this assumption in the NLP context. then called a pac learner. The algorithm is an efficient pac learning algorithm if this is done in time polynomial in the relevant parameters. 3 Robust Learning A more realistic variant of the pac-learning model, the This section defines a learning algorithm and a class of agnostic learning model [Haussler, 1992; Kearns et al, hypotheses with some generalization properties, that we 1992], applies when we do not want to assume that the later show to capture many probabilistic learning meth• labeled training examples arise from a target con• ods used in NLP. The learning algorithm discussed here cept of an a-priori simple structure In this model is a Statistical Queries (SQ) algorithm [Kearns, 1993]. one assumes that data elements are sampled ac• An SQ algorithm can be viewed as a learning algorithm cording to some arbitrary distribution D on that interacts with its environment in a restricted way. D may simply reflect the distribution of the data as Rather than viewing examples, the algorithm only re• it occurs "in nature" (including contradictions) with• quests the values of various statistics on the distribution out assuming that the labels are generated according of the labeled examples to construct its hypothesis. (E.g. to some "rule". The true error of the hypothesis is "What is the probability that a randomly chosen labeled defined to be and the example (x, I) has = 0 and l = A feature is an in• goal of the learner is to compute, with high probabil• dicator function {0,1} which defines a subset of BOTH 899 the instance space - all those elements which are mapped 3.1 Linear Statistical Queries Hypotheses to 1 by denotes a class of such indicator functions.

Load more