A New Perspective of Statistical Modeling by PRISM
Total Page:16
File Type:pdf, Size:1020Kb
A New Perspective of Statistical Modeling by PRISM Taisuke SATO Tokyo Institute of Technology / CREST, JST 2-12-1 Ookayamaˆ Meguro-ku Tokyo Japan 152-8552 Neng-Fa Zhou The City University of New York 2900 Bedford Avenue, Brooklyn, NY 11210-2889 Abstract arbitrarily complex symbolic statistical models that may go beyond existing statistical models. PRISM was born in 1997 as a symbolic statistical PRISM’s power comes from three independent yet interre- modeling language to facilitate modeling complex lated ingredients. [ systems governed by rules and probabilities Sato • firm mathematical semantics (distribution semantics) ] and Kameya, 1997 . It was the first programming [Sato, 1995] language with EM learning ability and has been • shown to be able to cover popular symbolic sta- all solution search using memoizing (OLDT [Tamaki tistical models such as Bayesian networks, HMMs and Sato, 1986] and linear tabling [Zhou and Sato, (hidden Markov models) and PCFGs (probabilistic 2002]) context free grammars) [Sato and Kameya, 2001]. • EM learning of parameters embedded in a program by Last year, we entirely reimplemented PRISM based the graphical EM algorithm [Kameya and Sato, 2000] on a new tabling mechanism of B-Prolog [Zhou We will not go into the detail of each ingredient, but PRISM and Sato, 2002]. As a result, we can now deal has proved to cover most popular statistical models such as with much larger data sets and more complex mod- HMMs (hidden Markov models) [Rabiner, 1989; Rabiner and els. In this paper, we focus on this recent develop- Juang, 1993], PCFGs (probabilistic context free grammars) ment and report two modeling examples in statisti- [Wetherell, 1980; Manning and Sch¨utze, 1999] and Bayesian cal natural language processing. One is a declar- networks [Pearl, 1988; Castillo et al., 1997] with the same ative PDCG (probabilistic definite clause gram- time complexity [Sato and Kameya, 2001]. Moreover, we mar) program which simulates top-down parsing. have experimentally confirmed that the learning speed of the The other is a left-corner parsing program which graphical EM algorithm [Kameya and Sato, 2000],anEMal- describes a bottom-up parsing that manipulates a gorithm for ML (maximum likelihood) estimation employed stack. The fact that these rather different types of in PRISM for parameter learning outperforms that of the stan- modeling and their EM learning are uniformly pos- dard Inside-Outside algorithm for PCFGs by two or three or- sible through PRISM programming shows the ver- ders of magnitude [Sato et al., 2001]. satility of PRISM.1 From the view point of statistical modeling, one of the sig- nificant achievements of PRISM is the elimination of the need for deriving new EM algorithms for new applications. When 1 Introduction a user constructs a statistical model with hidden variables, all 2 he or she needs is to write a PRISM program using probabilis- PRISM was born in 1997 as a symbolic statistical model- tic built-ins such as msw/2 predicate representing a parame- ing language to facilitate modeling complex systems gov- terized random switch. The remaining work, namely param- [ erned by rules and probabilities Sato and Kameya, 1997; eter estimation (learning), is taken care of by the graphical ] 2001 . The basic idea is to incorporate a statistical learning EM algorithm quite efficiently thanks to dynamic program- mechanism into logic programs for embedded parameters. ming. Furthermore, as long as certain modeling principles The result is a unique programming language for symbolic are observed, it is mathematically assured that the program statistical modeling. Actually it was the first programming correctly performs EM learning (this is not self-evident when language with EM learning ability and has been shown to be the model gets complicated). One may say that PRISM is a able to cover popular symbolic statistical models. Theoreti- generic tool for ubiquitous EM learning. cally it is an embodiment of Turing machines with learning The development of PRISM was gradual because we at- ability, but the real consequence is that it enables us to build tempted to fulfill two rather conflicting requirements; exploit- 1 ing the generality of the semantics and achieving reasonable This paper is partly based on [Sato and Motomura, 2002]. efficiency for real applications. After all we decided to com- 2 URL= http://sato-www.cs.titech.ac.jp /prism/index.html promise the generality of semantics and to assume some in- dependence conditions on programs because while these con- tics guarantees the existence of a unique probability measure, ditions somewhat restrict the class of acceptable programs, treating every ground atom as a binary random variable. they greatly simplify probability computations thereby mak- A PRISM program DB is a set of definite clauses. We write ing fast EM learning possible. it as DB = F ∪ R where F is a set of facts (unit clauses) and Our EM learning consists of two phases. In the first pre- R is a set of rules (non-unit clauses) to emphasize the differ- processing phase, all solutions are searched for a given goal ence of role between facts and rules. One of the unique fea- with respect to a program, yielding a hierarchical graph called tures of PRISM is that F has a basic joint probability distri- 3 an explanation graph (support graph). In the second learning bution PF . Put it differently, the truth of ground unit clauses phase, we run the graphical EM algorithm on the explanation A1,A2,...in F is probabilistic and their statistical behavior graph to train parameters in the program. The graphical EM is specified by PF . Here we consider ground unit clauses as algorithm is efficient in the sense that it runs in time linear in random variables taking on 1 (true) or 0 (false). the size of the explanation graph in each iteration [Sato and What distinguishes our approach from existing approaches Kameya, 2001]. In this learning scheme, compared to the ef- to probabilistic semantics is that our semantics admits infinite ficiency of the graphical EM algorithm in the learning phase, domains and allows us to use infinitely many random vari- all solution search in the preprocessing phase could be a bot- ables (probabilistic ground atoms). Consequently we need tleneck. A naive search by backtracking would take exponen- not make a distinction between Bayesian networks where a tial search time. The key technology to efficiency is memoiz- finite number of random variables appear and PCFGs where ing, i.e. to table calls and returns of predicates for later reuse a countably infinite number of random variables are required. which often reduces exponential time complexity to polyno- They are just two classes of PRISM programs. Another con- mial time complexity. However, the early versions of PRISM sequence is that we can implement a variety of EM algorithms were built on top of SICStus Prolog and it was practically as PRISM programs as long as they express, roughly speak- impossible to directly incorporate a full tabling mechanism. ing, Turing machines with probabilistic choices. Last year, we replaced the underlying Prolog with B- Prolog and reimplemented PRISM with a full linear tabling 2.2 Grass wet example mechanism [Zhou and Sato, 2002]. As a result, we can now To put the idea of PRISM across, we show a propositional 4 deal with much larger data sets and more complex models. PRISM program DB p = Rp ∪ Fp in Figure 1. It dose not In this paper, we focus on this recent development and report include any first-order features of PRISM such as logical vari- two modeling examples in statistical natural language pro- ables and function symbols. cessing. One is a declarative PDCG (probabilistic definite clause grammar) program which simulates top-down pars- ing. The other is a left-corner parsing program which pro- g wet :- s on. cedurally describes a bottom-up parsing that manipulates a Rp = g wet :- s off, w rain. stack. The fact that these rather different types of modeling g dry :- s off, w clear. and their EM learning are uniformly possible through PRISM programming shows the versatility of PRISM. F = s on. s off. p w rain. w clear. 2 Preliminaries 2.1 A quick review of PRISM Figure 1: Wet grass program DB p PRISM is a probabilistic extension of Prolog [Sterling and Shapiro, 1986]. A Prolog program is a set of logical formulas Rp expresses our causal knowledge on the events repre- called definite clauses which take the form H:-B1,...,Bk sented by six propostions: g wet (“grass is wet”), g dry (k ≥ 0). H is an atom called the head, and B1,...,Bk is a (“grass is dry”), s on (“sprinkler is on”), s off (“sprinkler conjunction of atoms called the body. The clause says if B1 is off”), w rain (“it is rainy”) and w clear (“it is clear”). and ···and Bk hold, then H holds (declarative reading). In The first clause says the grass is wet if the sprinkler is on. The the context of top-down computation however, it should be second clause says the grass is wet also if the sprinkler is off read that to achieve goal H, achieve subgoals B1 and ...and but the weather is rain. The last clause says the grass is dry if Bk (procedural reading). This twofold reading i.e. bottom-up the sprinkler is off and the weather is clear. We assume these declarative reading, vs. top-down procedural reading, makes rules hold without uncertainty. it possible to write declarative but executable programs that In addition to the causal knowledge described above, we encodes both declarative and procedural knowledge in a uni- know that the states of weather and the sprinkler are proba- fied way. When k =0, the clause is called a unit clause.