<<

A Tutorial on and Learning in Bayesian Networks

Irina Rish

IBM T.J.Watson Research Center [email protected] http://www.research.ibm.com/people/r/rish/ Outline

 Motivation: learning probabilistic models from data  Representation: models  Probabilistic inference in Bayesian Networks  Exact inference  Approximate inference  Learning Bayesian Networks

 Learning parameters

 Learning graph structure (model selection)  Summary Bayesian Networks

Structured, graphical representation of probabilistic relationships between several random variables Explicit representation of conditional independencies Missing arcs encode Efficient representation of joint PDF P(X) (not just discriminative): allows arbitrary queries to be answered, e.g.

P (lung cancer=yes | smoking=no, positive X-ray=yes ) = ? Bayesian Network: BN = (G, Θ)

P(S) G - (DAG) Smoking nodes – random variables edges – direct dependencies P(C|S) P(B|S) - set of parameters in all Bronchitis Θ lung Cancer conditional distributions (CPDs) CPD: P(X|C,S) C B D=0 D=1 P(D|C,B) 0 0 0.1 0.9 CPD of X-ray Dyspnoea 0 1 0.7 0.3 node X: 1 0 0.8 0.2 P(X|parents(X)) 1 1 0.9 0.1

Compact representation of joint distribution in a product form (chain rule): P(S, C, B, X, D) = P(S) P(C|S) P(B|S) P(X|C,S) P(D|C,B)

1+ 2 + 2 + 4 + 4 = 13 parameters instead of 2\ = ZY Example: Printer Troubleshooting

Application Output OK Spool Print Process OK Spooling On Spooled GDI Data Data OK Input OK Local Disk Uncorrupted Space Adequate Driver Correct GDI Data Driver Output OK Correct Driver Settings Correct Printer Print Selected Data OK Network Net/Local Up Printing Correct Correct Net PC to Printer Local Local Port Printer Path Path OK Transport OK Path OK Local Cable Connected Net Cable Connected Paper Printer On Printer Loaded and Online Data OK Printer Memory Print Output Adequate 26 variables OK Instead of 226 parameters we get 99 = 17x1 + 1x21 + 2x22 + 3x23 + 3x24

[Heckerman, 95] “Moral” graph of a BN Moralization algorithm: P(S) 1. Connect (“marry”) parents Smoking P(B|S) of each node. P(C|S) 2. Drop the directionality of lung Cancer Bronchitis the edges.

P(D|C,B) Resulting undirected graph is X-ray Dyspnoea called the “moral” graph of BN P(X|C,S)

Interpretation: every pair of nodes that occur together in a CPD is connected by an edge in the .

CPD for X and its k parents (called “family”) is represented by a clique of size k (k+1) in the moral graph, and contains d ( d − 1 ) probability parameters where d is the number of values each variable can have (domain size). Conditional Independence in BNs: Three types of connections Serial S Diverging Smoking Visit to Asia L B A Lung Cancer Bronchitis Knowing S makes L and B T Tuberculosis independent (common cause)

Converging Chest X-ray LB X Lung Cancer Bronchitis Knowing T makes A and X independent D Dyspnoea NOT knowing D or M (intermediate cause) makes L and B M independent Running (common effect) Marathon d-separation

Nodes X and Y are d-separated if on any (undirected) path between X and Y there is some variable Z such that is either Z is in a serial or diverging connection and Z is known, or Z is in a converging connection and neither Z nor any of Z’s descendants are known

X Z X Y Z X Y Z Y M

Nodes X and Y are called d-connected if they are not d-separated (there exists an undirected path between X and Y not d- separated by any node or a set of nodes) If nodes X and Y are d-separated by Z, then X and Y are conditionally independent given Z (see Pearl, 1988) Independence Relations in BN

A variable (node) is conditionally independent of its non-descendants given its parents

Smoking

Lung Cancer Bronchitis

Given Bronchitis and Lung Cancer, Chest X-ray is independent Dyspnoea Dyspnoea of X-ray (but may depend Running on Running Marathon) Marathon Markov Blanket A node is conditionally independent of ALL other nodes given its Markov blanket, i.e. its parents, children, and “spouses’’ (parents of common children) (Proof left as a homework problem ☺)

Age Gender

Exposure to Toxins Smoking

Diet Cancer

Serum Calcium Lung Tumor

[Breese & Koller, 97] What are BNs useful for?

 Diagnosis: P(cause|symptom)=? cause

 Prediction: P(symptom|cause)=? C1 C2  Classification: max P(class|data) class  Decision-making (given a cost function) symptom

Medicine Speech Bio- informatics recognition

Text Stock market Classification Computer troubleshooting Application Examples

APRI system developed at AT&T Bell Labs learns & uses Bayesian networks from data to identify customers liable to default on bill payments NASA Vista system predict failures in propulsion systems considers time criticality & suggests highest utility action dynamically decide what information to show Application Examples

Office Assistant in MS Office 97/ MS Office 95 Extension of Answer wizard uses naïve Bayesian networks help based on past experience (keyboard/mouse use) and task user is doing currently This is the “smiley face” you get in your MS Office applications Microsoft Pregnancy and Child-Care Available on MSN in Health section Frequently occurring children’s symptoms are linked to expert modules that repeatedly ask parents relevant questions Asks next best question based on provided information Presents articles that are deemed relevant based on information provided IBM’s systems management applications for Systems @ Watson

(Hellerstein, Jayram, Rish (2000)) (Rish, Brodie, Ma (2001)) Fault diagnosis using probes End-user transaction recognition Software or hardware components Issues: X 2 X 3 Remote Procedure Calls (RPCs)  Efficiency (scalability) X 1  Missing data/noise: sensitivity analysis R2 R5 R3 R2 R1 R2 R5  “Adaptive” probing:

 selecting “most- informative” probes T  on-line Transaction1 Transaction2 4 learning/model T 1 T updates T 3 BUY? OPEN_DB? Probe outcomes 2  on-line diagnosis SELL? SEARCH? Goal: finding most-likely diagnosis Q Q OŸXS…Ÿ• ) = ˆ™ŽG”ˆŸwOŸXS…Ÿ• £›XS…›• P Ÿ:S…Ÿ_

PatternPattern discovery, discovery, classification, classification, diagnosisdiagnosis and and prediction prediction Probabilistic Inference Tasks  Belief updating: = = BEL(Xi ) P(Xi xi | evidence)

 Finding most probable explanation (MPE) x* = argmaxP(x,e) x

 Finding maximum a-posteriory hypothesis ⊆ * * = A X : (a1,...,ak ) argmax ∑P(x,e) a hypothesis variables X/A  Finding maximum-expected-utility (MEU) decision

* * = D ⊆ X : decision variables (d 1 ,..., d k ) arg max ∑ P( x, e)U( x) d X/D U(x) : utility function Belief Updating Task: Example

Smoking

lung Cancer Bronchitis

X-ray Dyspnoea

P (smoking| dyspnoea=yes ) = ? Belief updating: find P(X|evidence)

P(s,d = 1) P(s|d=1) = ∝ P(s,d = 1) = S P(d = 1) ∑ CC BB P(s)P(c|s)P(b|s)P(x|c,s)P(d|c,b)= d =1,b,x,c

XX DD P(s) ∑ ∑ P(b|s)∑ ∑P(c|s)P(x|c,s)P(d|c,b) = “Moral” graph d 1 b x c f (s, d , b, x) W*=4 Complexity: O(n exp(w*)) ”induced width” (max induced clique size) Efficient inference: variable orderings, conditioning, approximations Variable elimination algorithms (also called “bucket-elimination”)

Belief updating: VE-algorithm elim-bel (Dechter 1996) ∑ ∏ b Elimination operator bucket B: P(b|a) P(d|b,a) P(e|b,c) B

B bucket C: P(c|a) h (a, d, c, e) C bucket D: h C (a, d, e) D bucket E: e=0 h D (a, e) E E W*=4 bucket A: P(a) h (a) ”induced width” A P(a|e=0) (max clique size) Finding MPE = max P(x) x VE-algorithm elim-mpe (Dechter 1996)

∑ is replaced by max : MPE = max P(a)P(c | a)P(b | a)P(d | a, b)P(e | b, c) a,e,d ,c,b max ∏ b Elimination operator bucket B: P(b|a) P(d|b,a) P(e|b,c) B

B bucket C: P(c|a) h (a, d, c, e) C bucket D: h C (a, d, e) D bucket E: e=0 h D (a, e) E bucket A: P(a) h E (a) W*=4 ”induced width” A MPE (max clique size) probability Generating the MPE-solution

5. b' = arg max P(b | a' ) × b B: P(b|a) P(d|b,a) P(e|b,c) × P(d' | b, a' ) × P(e' | b, c' )

4. c' = arg max P(c | a' ) × h B (a, d, c, e) c C: P(c|a) × h B (a' ,d' ,c, e' ) hC (a,d,e) 3. d' = arg max hC (a' ,d,e' ) D: d h D (a, e) 2. e' = 0 E: e=0

E 1. a' = arg max P(a) ⋅ h E (a) A: P(a) h (a) a Return (a' , b' ,c' ,d' ,e' ) * Complexity of VE-inference: O(n exp(wo )) 3 ž ` − ›ŒGG•‹œŠŒ‹ GGž‹›GG–GG”–™ˆ“GGŽ™ˆ—GGˆ“–•ŽGG–™‹Œ™•Ž GG– 3 tŒˆ••Ž aGGž ` +XG= š¡ŒG–G“ˆ™ŽŒš›GŠ“˜œŒGŠ™Œˆ›Œ‹ G‹œ™•ŽG•Œ™Œ•ŠŒ

The width wo (X) of a variable X in graph G along the ordering o is the number of nodes preceding X in the ordering and connected to X (earlier neighbors ).

The width wo of a graph is the maximum width wo (X) among all nodes. The induced graph G' along the ordering o is obtained by recursively connecting earlier neighbors of each node, from last to the first in the ordering. The width of the induced graph G' is called the induced width of the graph G (denoted w* ). o A B E C D B C D C E B D E A A * * “Moral” graph w = 4 w = 2 o1 o2 Ordering is important! But finding min-w* ordering is NP- hard… Inference is also NP-hard in general case [Cooper]. Learning Bayesian Networks

 Combining domain expert <9.7 0.6 8 14 18> <0.2 1.3 5 ?? ??> <1.3 2.8 ?? 0 1 > knowledge with data ……………….

 Efficient representation and inference

 Incremental learning: P(H) or

 Handling missing data: <1.3 2.8 ?? 0 1 >

 Learning causal relationships: S C Learning tasks: four main cases

 Known graph – learn parameters S P(S) Complete data: P(C|S) P(B|S) parameter estimation (ML, MAP) C B

Incomplete data: P(X|C,S) P(D|C,B) non-linear parametric X D optimization (gradient descent, EM)

 Unknown graph – learn graph and parameters Complete data: S S optimization (search C B C B in space of graphs)

Incomplete data: X D X D structural EM, mixture models Gˆ = arg max Score(G) G Learning Parameters: complete data (overview)

 ML-estimate: max log P(D | Θ) Θ - decomposable!

PaX Multinomial counts C B N θ = θ = x,pa X x ,pa X ML( ) x,pa X ∑ N P( x | pa ) x,pa X X X x

 MAP-estimate max log P(D | Θ)P(Θ) (Bayesian ) Θ Conjugate priors - Dirichlet Dir(θ |α ,...,α ) paX 1,paX m,paX N + α MAP(θ ) = x,paX x,paX x,paX ∑ N + ∑α Equivalent sample size x,paX x,paX (prior knowledge) x x Learning Parameters (details)

Learning Parameters: incomplete data

Non-decomposable marginal likelihood (hidden nodes)

Data Initial parameters S X D C B Expectation <1 1 ? 0 1> <0 0 0 ??> Compute EXPECTED ……… Current model Counts via inference in BN

(G,Θ) Expected counts

E [N ] = Maximization P( X ) x,pa X Update parameters N aX k Θ EM-algorithm: (ML, MAP) ∑ p(x, x | y , ,G) iterate until convergence k=1 Learning graph structure

ˆ Find G = arg max Score(G) NP-hard G optimization  Heuristic search: Complete data – local computations Incomplete data (score non- S Add S->B decomposable):stochastic methods S C B C B Delete Local greedy search; K2 algorithm S->B S Reverse S->B  Constrained-based B C S methods (PC/IC algorithms) C B  Data impose independence relations (constraints) on graph structure Scoring function: Minimum Description Length (MDL)

 Learning  data compression

<9.7 0.6 8 14 18> <0.2 1.3 5 ?? ??> <1.3 2.8 ?? 0 1 > ………………. log N MDL(BN | D) = − log P(D | Θ,G) + | Θ | 2

DL(Data|model) DL(Model)

 Other: MDL = -BIC (Bayesian Information Criterion)  Bayesian score (BDe) - asymptotically equivalent to MDL Model selection trade-offs

Naïve Bayes – too simple Unrestricted BN – too complex (less parameters, but bad model) (possible overfitting + complexity)

Class Class

P(fn | class) P(fn | class) P(f1 | class) P(f2 | class) P(f1 | class) P(f2 | class)

feature f1 feature f2 feature fn feature f1 feature f2 feature fn

Various approximations between the two extremes

TAN: tree-augmented Naïve Bayes Class [Friedman et al. 1997]

P(f | class) P(f | class) P(fn | class) Based on Chow-Liu Tree Method 1 2 feature f feature f feature f (CL) for learning trees 1 2 n [Chow-Liu, 1968] Tree-structured distributions A joint is tree-structured if it can be written as n Ÿ = P( ) ∏ P(xi | x j(i) ) i=1 where x j(i) is the parent of xi in Bayesian network for P(x) (a directed tree) A A C B C B

P(A,B,C,D,E)= D E D E P(A)P(B|A)P(C|A) P(D|C)P(E|B) Not a tree – has an (undirected) cycle A tree (with root A) A tree requires only [(d-1) + d(d-1)(n-1)] parameters, where d is domain size Moreover, inference in trees is O(n) (linear) since their w*=1 Approximations by trees

True distribution P(X) Tree-approximation P’(X)

A A

C C B B

E D D E

How good is approximation? Use cross-entropy (KL-divergence):

P(Ÿ) D(P, P') = P∑ P(Ÿ)log Ÿ P'(Ÿ) D(P,P’) is non-negative, and D(P,P’)=0 if and only if P coincides with P’ (on a set of measure 1) How to find the best tree-approximation? Optimal trees: Chow-Liu result

 Lemma Given a joint PDF P(x) and a fixed tree structure T, the best approximation P’(x) (i.e., P’(x) that minimizes D(P,P’) ) satisfies = = P'(xi | x j(i) ) P(xi | x j(i) ) for all i 1,...,n Such P’(x) is called the projection of P(x) on T.

 Theorem [Chow and Liu, 1968] Given a joint PDF P(x), the KL-divergence D(P,P’) is minimized by projecting P(x) on a maximum-weight spanning tree (MSWT) over nodes in X, where the weight on the edge ( X i , X j ) is defined by the measure P(x , x ) = i j I(X i ; X j ) ∑ P(xi , x j )log xi ,x j P(xi )P(x j )

Note, that I(X;Y) = 0 when X and Y are independent and that I(X;Y) = D(P(x, y),P(x)P(y)) Proofs Proof of Lemma : n n = − Ÿ + Ÿ Ÿ = − Ÿ − = D(P, P') ∑ P( )∑ log P'(xi | x j(i) ) ∑ P( )log P( ) ∑ P( )∑ log P'(xi | x j(i) ) H (X ) Ÿ i=1 Ÿ Ÿ i=1 n n = − Ÿ − = − − = ∑∑P( )log P'(xi | x j(i) ) H (X ) ∑∑P(xi , x j(i) )log P'(xi | x j(i) ) H (X ) (1) = Ÿ = i 1 i 1 xi ,x j (i ) n = − − ∑∑P(x j(i) ) ∑P(xi | x j(i) )log P'(xi | x j(i) ) H (X ) (2) = i 1 xxji(i )

A known fact : given P(x), the maximum of ∑ P(x)log P'(x) is achieved by the choice P'(x) = P(x).

xi

Therefore, for any value of i and x j(i) , the term ∑ P(xi | x j(i) )log P'(xi | x j(i) ) is maximized by x = i choosing P'(xi | x j(i) ) P(xi | x j(i) ) (and thus the total D(P, P') is minimized) , which proves the Lemma.

Proof of Theorem : = Replacing P'(xi | x j(i) ) P(xi | x j(i) ) in the expression (1) yields n = − − = D(P, P') ∑∑P(xi , x j(i) )log[P(xi x j(i) ) / P(x j(i) )] H (X ) = i 1 xi ,x j (i ) n  P(x x )  = − i j(i) + − = ∑∑P(xi , x j(i) )log log P(xi ) H (X ) = i 1 xi ,x j (i )  P(x j(i) )P(xi )  n n = − − − ∑∑∑I (X i , X j(i) ) P(xi ) log P(xi ) H (X ). == i 11ixi The last two terms are independent of the choice of the tree, and thus D(P, P') is minimized by maximizing the sum of edge weights I (X i , X j(i) ).

Chow-Liu algorithm [As presented in Pearl, 1988]

1. From the given distribution P(x) (or from data generated by P(x)), ≠ compute the joint distributionP (xi | x j ) for all i j 2. Using the pairwise distributions from step 1, compute the mutual

informationI( X i ; X j ) for each pair of nodes and assign it as the weight to the corresponding edge( X i , X j ) . 3. Compute the maximum-weight spanning tree (MSWT): a. Start from the empty tree over n variables b. Insert the two largest-weight edges c. Find the next largest-weight edge and add it to the tree if no cycle is formed; otherwise, discard the edge and repeat this step. d. Repeat step (c) until n-1 edges have been selected (a tree is constructed). 4. Select an arbitrary root node, and direct the edges outwards from the root. 5. Tree approximation P’(x) can be computed as a projection of P(x) on the resulting directed tree (using the product-form of P’(x)). Summary: Learning and inference in BNs

 Bayesian Networks – graphical probabilistic models  Efficient representation and inference  Expert knowledge + learning from data

 Learning:  parameters (parameter estimation, EM)  structure (optimization w/ score functions – e.g., MDL)  Complexity trade-off:

 NB, BNs and trees  There is much more: , modeling time (DBNs, HMMs), approximate inference, on-line learning, active learning, etc. Online/print resources on BNs

Conferences & Journals UAI, ICML, AAAI, AISTAT, KDD MLJ, DM&KD, JAIR, IEEE KDD, IJAR, IEEE PAMI Books and Papers Bayesian Networks without Tears by Eugene Charniak. AI Magazine: Winter 1991. Probabilistic Reasoning in Intelligent Systems by . Morgan Kaufmann: 1998. Probabilistic Reasoning in Expert Systems by Richard Neapolitan. Wiley: 1990. CACM special issue on Real-world applications of BNs, March 1995 Online/Print Resources on BNs

AUAI online: www.auai.org. Links to: Electronic proceedings for UAI conferences Other sites with information on BNs and reasoning under uncertainty Several tutorials and important articles Research groups & companies working in this area Other societies, mailing lists and conferences Publicly available s/w for BNs

List of BN software maintained by Russell Almond at bayes.stat.washington.edu/almond/belief.html several free packages: generally research only commercial packages: most powerful (& expensive) is HUGIN; others include Netica and Dxpress we are working on developing a Java based BN toolkit here at Watson