Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions

Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions Mathieu Sinn Bei Chen IBM Research - Ireland McMaster University Mulhuddart, Dublin 15 Hamilton, Ontario, Canada [email protected] [email protected] Abstract Conditional Markov Chains (also known as Linear-Chain Conditional Random Fields in the literature) are a versatile class of discriminative models for the distribution of a sequence of hidden states conditional on a sequence of observable variables. Large-sample properties of Conditional Markov Chains have been first studied in [1]. The paper extends this work in two directions: first, mixing properties of models with unbounded feature functions are being established; second, necessary conditions for model identifiability and the uniqueness of maximum likelihood estimates are being given. 1 Introduction Conditional Random Fields (CRF) are a widely popular class of discriminative models for the distribution of a set of hidden states conditional on a set of observable variables. The fundamental assumption is that the hidden states, conditional on the observations, form a Markov random field [2,3]. Of special importance, particularly for the modeling of sequential data, is the case where the underlying undirected graphical model forms a simple linear chain. In the literature, this subclass of models is often referred to as Linear-Chain Conditional Random Fields. This paper adopts the terminology of [4] and refers to such models as Conditional Markov Chains (CMC). Large-sample properties of CRFs and CMCs have been first studied in [1] and [5]. [1] defines CMCs of infinite length and studies ergodic properties of the joint sequences of observations and hidden states. The analysis relies on fundamental results from the theory of weak ergodicity [6]. The exposition is restricted to CMCs with bounded feature functions which precludes the application, e.g., to models with linear features and Gaussian observations. [5] considers weak consistency and central limit theorems for models with a more general structure. Ergodicity and mixing of the models is assumed, but no explicit conditions on the feature functions or on the distribution of the observations are given. An analysis of model identifiability in the case of finite sequences can be found in [7]. The present paper studies mixing properties of Conditional Markov Chains with unbounded feature functions. The results are fundamental for analyzing the consistency of Maximum Likelihood estimates and establishing Central Limit Theorems (which are very useful for constructing statistical hypothesis tests, e.g., for model misspecificiations and the signficance of features). The paper is or- ganized as follows: Sec. 2 reviews the definition of infinite CMCs and some of their basic properties. In Sec. 3 the ergodicity results from [1] are extended to models with unbounded feature functions. Sec. 4 establishes various mixing properties. A key result is that, in order to allow for unbounded feature functions, the observations need to follow a distribution such that Hoeffding-type concentra- tion inequalities can be established. Furthermore, the mixing rates depend on the tail behaviour of the distribution. In Sec. 5 the mixture properties are used to analyze model identifiability and consistency of the Maximum Likelihood estimates. Sec. 6 concludes with an outlook on open problems for future research. 2 Conditional Markov Chains Preliminaries. We use N, Z and R to denote the sets of natural numbers, integers and real numbers, respectively. Let X be a metric space with the Borel sigma-field A, and Y be a finite set. Further- more, consider a probability space (Ω; F; P) and let X = (Xt)t2Z, Y = (Yt)t2Z be sequences of measurable mappings from Ω into X and Y, respectively. Here, • X is an infinite sequence of observations ranging in the domain X , • Y is an aligned sequence of hidden states taking values in the finite set Y. For now, the distribution of X is arbitrary. Next we define Conditional Markov Chains, which parameterize the conditional distribution of Y given X. Definition. Consider a vector f of real-valued functions f : X × Y × Y ! R, called the feature functions. Throughout this paper, we assume that the following condition is satisfied: (A1) All feature functions are finite: jf(x; i; j)j < 1 for all x 2 X and i; j 2 Y. Associated with the feature functions is a vector λ of real-valued model-weights. The key in the definition of Conditional Markov Chains is the matrix M(x) with the (i; j)-th component m(x; i; j) = exp(λT f(x; i; j)): In terms of statistical physics, m(x; i; j) measures the potential of the transition between the hidden states i and j from time t−1 to t, given the observation x at time t. Next, for a sequence x = (xt)t2Z in X and time points s; t 2 Z with s ≤ t, introduce the vectors t T T T αs(x) = M(xt) ::: M(xs) (1; 1;:::; 1) ; t T βs(x) = M(xs+1) ::: M(xt) (1; 1;:::; 1) ; t t t and write αs(x; i) and βs(x; j) to denote the ith respectively jth components. Intuitively, αs(x; i) measures the potential of the hidden state i at time t given the observations xs,..., xt and assuming t that at time s − 1 all hidden states have potential equal to 1. Similarly, βs(x; j) is the potential of j at time s assuming equal potential of all hidden states at time t. Now let t 2 Z and k 2 N, and define the distribution of the labels Yt,..., Yt+k conditional on X, k Y P(Yt = yt;:::;Yt+k = yt+k j X) := m(Xt+i; yt+i−1; yt+i) i=1 t t+k+n α (X; yt) β (X; yt+k) × lim t−n t+k : n!1 t T t+k+n αt−n(X) βt (X) Note that, under assumption (A1), the limit on the right hand side is well-defined (see Theorem 2 in [1]). Furthermore, the family of all marginal distributions obtained this way satisfies the consistency conditions of Kolmogorov’s Extension Theorem. Hence we obtain a unique distribution for Y conditional on X parameterized by the feature functions f and the model weights λ. Intuitively, the distribution is obtained by conditioning the marginal distributions of Y on the finite observational context (Xt−n;:::;Xt+k+n), and then letting the size of the context going to infinity. Basic properties. We introduce the following notation: For any matrix P = (pij) with strictly positive entries let φ(P ) denote the mixing coefficient p p φ(P ) = min ik jl : i;j;k;l pjkpil Note that 0 ≤ φ(P ) ≤ 1. This coefficient will play a key role in the analysis of mixing properties. The following proposition summarizes fundamental properties of the distribution of Y conditional on X, which directly follow from the above definition (also see Corollary 1 in [1]). Proposition 1. Suppose that condition (A1) holds true. Then Y conditional on X forms a time- inhomogeneous Markov chain. Moreover, if X is strictly stationary, then the joint distribution of the aligned sequences (X; Y ) is strictly stationary. The conditional transition probabilities Pt(x; i; j) := P(Yt = j j Yt−1 = i; X = x) of Y given X = x have the following form: n βt (x; j) Pt(x; i; j) = m(xt; i; j) lim : n!1 n βt−1(x; i) In particular, a lower bound for Pt(x; i; j) is given by m(xt; i; j) (mink2Y m(xt+1; i; k)) Pt(x; i; j) ≥ ; jYj (maxk2Y m(xt; j; k)) (maxk;l2Y m(xt+1; k; l)) and the matrix of transition probabilities P t(x), with the (i; j)-th component given by Pt(x; i; j), satisfies φ(P t(x)) = φ(M(xt)). 3 Ergodicity In this section we establish conditions under which the aligned sequences (X; Y ) are jointly ergodic. Let us first recall the definition of ergodicity of X (see [8]): By X we denote the space of sequences x = (xt)t2Z in X , and by A the corresponding product σ-field. Consider the probability measure PX on (X ; A) defined by PX (A) := P(X 2 A) for A 2 A. Finally, let τ denote the operator on X which shifts sequences one position to the left: τx = (xt+1)t2Z. Then ergodicity of X is formally defined as follows: −1 (A2) X is ergodic, that is, PX (A) = PX (τ A) for every A 2 A, and PX (A) 2 f0; 1g for every set A 2 A satisfying A = τ −1A. −1 As a particular consequence of the invariance PX (A) = PX (τ A), we obtain that X is strictly stationary. Now we are able to formulate the key result of this section, which will be of central importance in the later analysis. For simplicity, we state it for functions depending on the values of X and Y only at time t. The generalization of the statement is straight-forward. In our later analysis, we will use the theorem to show that the time average of feature functions f(Xt;Yt−1;Yt) converges to the expected value E[f(Xt;Yt−1;Yt)]. Theorem 1. Suppose that conditions (A1) and (A2) hold, and g : X × Y ! R is a function which satisfies E[jg(Xt;Yt)j] < 1. Then n 1 X lim g(Xt;Yt) = E[g(Xt;Yt)] P-almost surely. n!1 n t=1 t−1 t−1 Proof. Consider the sequence Z = (Zt)t2N given by Zt := (τ X;Yt), where we write τ to denote the (t − 1)th iterate of τ.

Load more