Parameterized Complexity Results for Probabilistic Network Structure Learning

Stefan Szeider Vienna University of Technology, Austria

Workshop on Applications of Parameterized Algorithms and Complexity APAC 2012 Warwick, UK, July 8, 2012 Joint work with:

Serge Gaspers Mikko Koivisto Mathieu Liedloff Sebastian Ordyniak UNSW, Sydney U Helsinki U d’Orleans Masaryk U Brno

Papers: • Algorithms and Complexity Results for Exact Bayesian Structure Learning. Ordyniak and Szeider. Conference on Uncertainty in Artificial Intelligence (UAI 2010). • An Improved Dynamic Progamming Algorithm for Exact Structure Learning. Ordyniak and Szeider. NIPS Workshop on Discrete Optimization in Machine Learning (DISCML 2011). • On Finding Optimal . Gaspers, Koivisto, Liedloff, Ordyniak, and Szeider. Conference on Artificial Intelligence (AAAI 2012).

APAC 2012 2 Stefan Szeider Probabilistic Networks

• Bayesian Networks (BNs) were introduced by Judea Perl in 1985 (2011 Turing Award Winner) • A BN is a DAG D=(V,A) plus tables associated with the nodes of the network • In addition to Bayesian networks, also other probabilistic networks have been considered: Markov Random Fields, Judea Pearl Factor Graphs, etc.

APAC 2012 3 Stefan Szeider Example:

Rain Sprinkler A Bayesian Rain T F T F Network F 0.4 0.6 0.8 0.2 T 0.01 0.99

Sprinkler Rain

Sprinkler Rain Grass Wet T F F F 0.8 0.2 Grass F T 0.01 0.99 T F 0.9 0.1 T T 0.99 0.01

APAC 2012 4 Stefan Szeider Applications

• diagnosis • computational biology • document classification • information retrieval • image processing • decision support etc. A BN showing the main pathophysiological • relationships in the diagnostic reasoning focused on a suspected pulmonary embolism event APAC 2012 5 Stefan Szeider Computational Problems

APAC 2012 6 Stefan Szeider Computational Problems

• BN Reasoning: Given a BN, compute the probability of a variable taking a specific value

APAC 2012 6 Stefan Szeider Computational Problems

• BN Reasoning: Given a BN, compute the probability of a variable taking a specific value • BN Learning: Given a set of sample data, find a BN that fits the data best.

APAC 2012 6 Stefan Szeider Computational Problems

• BN Reasoning: Given a BN, compute the probability of a variable taking a specific value • BN Learning: Given a set of sample data, find a BN that fits the data best. • BN Structure Learning: Given sample data, find the best DAG

APAC 2012 6 Stefan Szeider Computational Problems

• BN Reasoning: Given a BN, compute the probability of a variable taking a specific value • BN Learning: Given a set of sample data, find a BN that fits the data best. • BN Structure Learning: Given sample data, find the best DAG • BN Parameter Learning: Given sample data and a DAG, find the best probability tables

APAC 2012 6 Stefan Szeider Computational Problems

• BN Reasoning: Given a BN, compute the probability of a variable taking a specific value • BN Learning: Given a set of sample data, find a BN that fits the data best. • BN Structure Learning: Given sample data, find the best DAG • BN Parameter Learning: Given sample data and a DAG, find the best probability tables

APAC 2012 6 Stefan Szeider BN Learning and Local Scores

BN n2 Sample Data n1 n1 n2 n3 n4 n5 n6 .... 0 1 0 1 1 0 .... n4 1 1 0 1 1 0 .... n3 * 1 0 0 * 1 ....

0 * 1 1 * 1 .... n6 1 0 1 1 0 1 ...... n5 .... APAC 2012 7 ... Stefan Szeider BN Learning and Local Scores

Local Score Function f(n,P)=score of node n with P ⊆V as its in-neighbors (parents)

BN n2 Sample Data n1 n1 n2 n3 n4 n5 n6 .... 0 1 0 1 1 0 .... n4 1 1 0 1 1 0 .... n3 * 1 0 0 * 1 ....

0 * 1 1 * 1 .... n6 1 0 1 1 0 1 ...... n5 .... APAC 2012 7 ... Stefan Szeider Our Combinatorial Model

• BN Structure Learning: Input: a set N of nodes and a local sore function f (by explicit listing of nonzero tuples) Task: find a DAG D=(N,A) such that the sum of f(n,PD(n)) over all nodes n is maximum.

APAC 2012 8 Stefan Szeider “poster” — 2011/12/9 — 21:33 — page 1 — #1

3rd“poster” Workshop — 2011/12/9 on Discrete — 21:33 Optimization — page in1 — Machine #1 Learning (DISCML’11), Sierra Nevada, Spain, Dec 17, 2011

3rd WorkshopAn on Improved Discrete Optimization in Dynamic Machine Learning (DISCML’11), Programming Sierra Nevada, Spain, Algorithm Dec 17, 2011 for Exact Bayesian Network Structure Learning An Improved Dynamic Programming Algorithm for Exact Bayesian Network Structure Learning Sebastian Ordyniak and Stefan Szeider1

Sebastian Ordyniak and Stefan Szeider1

Exact Bayesian Network Structure Learning Super-structure Exact Bayesian NetworkExample Structure Learning Super-structure Problem: Given a local score function f over a set of nodes N find a Idea: Restrict the search to BNs whose skeleton is contained inside an Problem: Given a local score function f over a set of nodes NAnfind aoptimalIdea: BNRestrict the search to BNs whose skeleton is contained inside an BayesianABayesian local network score network (BN) function on (BN)N whose on N scorewhose is maximum score is maximum with respect with to f. respectundirected to f. graphundirected (the super-structure). graph (the super-structure). structure a a nPnPf(n, P)f(n, P) a a a ad d 1 1 { } { } a b,ac, db, c,0d.5 0.5 { } I A good super-structure contains b a,{f 1} I A good super-structure contains {b }a, f 1 c e { }1 b c d an optimal solution. {c } e 1 b c d an optimal solution.b c d d { } 1 7! I A super-structure can be b c d ; 7! I A super-structure can be e db 1 1 efficiently obtained, e.g., using { } ; efficiently obtained, e.g., using f ed b 1 1 the tiling algorithm. { } { } g cf, d d 1 1 e g the tiling algorithm. { }{ } f g c, d 1 e f g A local score{ function} f An optimal BN for f e f g Exact BayesianA local Network score Structure function Learningf is NP-hard.An optimal BN for f e f g Heuristic Algorithms: find first an undirected graph, then greedily orient Dynamic Lower Bound Approach Exact Bayesian Network Structure Learning is NP-hard. the edges,Dynamic then do Programming local search over for a improving Decomposition the orientation Idea: Ignore records that do not representDynamic optimal Lower BNs. Bound Approach Dynamic Programming over a Tree Decomposition APAC 2012 a, b, c, d 9 Stefan Szeider Idea:a, b,Ignorec, d recordsX X that do not represent optimal BNs.

a, b, c, d a, b, c, d X X

b, c, d b, c, d X

b, c, d b, c, d X e, b, c f, b, d g, c, d e, b, c f, b, d g, c, d

X

Idea: Usee, b a, treec decomposition of a super-structuref, b, d to find an optimal Method:g, c, d e, b, c f, b, d g, c, d BN via dynamic programming (Ordyniak and Szeider, UAI 2010). I For every record R compute a lower bound and an upper bound for X Method: the score of any BN represented by R. I Compactly represent locally optimal (partial) BNs via records ( ). I Ignore all records whose upper bound is below the minimum of the Idea: Use a tree decomposition of a super-structure to find an optimallower bounds over all records seen so far. I Compute the set of all records that represent locally optimal (partial) Method: BN via dynamic programming (Ordyniak and Szeider, UAI 2010). BNs for each node of the tree decomposition in a bottom-up manner. I For every record R compute a lower bound and an upper bound for the scoreExperimental of any BN Results represented by R. Bottleneck:Method: Large Number of Records! I Ignore all records whose upper bound is below the minimum of the I Compactly represent locally optimal (partial) BNs via recordsWe ( compared). the algorithm without (no LB) to the algorithm with the lower bounds over all records seen so far. I Compute the set ofReferences all records that represent locally optimaldynamic (partial) lower bound (LB) on the Alarm network. BNs for each node of the tree decomposition in a bottom-up manner. running time (s) memory usage (MB) I S. Ordyniak, S. Szeider, Algorithms and complexity results for exact super-structure no LB LB noExperimental LB ResultsLB BayesianBottleneck: structure Large learning, Number in: P. of Grünwald, Records! P. Spirtes (Eds.), 14 7 337 180 Proceedings of UAI 2010, The 26th Conference on Uncertainty in SSKEL TIL(We0.01 compared) 20785 the algorithm6393 15679 without (no5346 LB) to the algorithm with the Artificial Intelligence, Catalina Island, California, USA, July 8-11, 2010, S TIL(dynamic0.05) lower46560 bound16712 (LB38525) on the Alarm18786 network. AUAI Press, Corvallis, Oregon, 2010.References S TIL(0.1) 44554 16928 38520 18918 S I I. Tsamardinos, L. Brown, C. Aliferis, The max-min hill-climbing running time (s) memory usage (MB) I S. Ordyniak, S. Szeider, Algorithms and complexity results for exact Bayesian network structure learning algorithm, Machine Learning 65 I SKEL is the skeleton ofsuper-structure the Alarm network.no LB LB no LB LB Bayesian structure learning, in: P. Grünwald, P. Spirtes (Eds.), S (2006) 31–78. I (↵) is a super-structure learned with the tiling algorithm, for the STIL SKEL 14 7 337 180 Proceedings of UAI 2010, The 26th Conference on Uncertainty insignificance level ↵. S I E. Perrier, S. Imoto, S. Miyano, Finding optimal Bayesian network given TIL(0.01) 20785 6393 15679 5346 a super-structure,Artificial Intelligence, J. Mach. Learn. Catalina Res. Island, 9 (2008) California, 2251–2286. USA, July 8-11, 2010, S TIL(0.05) 46560 16712 38525 18786 AUAI Press, Corvallis, Oregon, 2010. The dynamic lower boundS approach reduces the running time and the memory usage by about a halfTIL( or0.1 more!) 44554 16928 38520 18918 I I. Tsamardinos, L. Brown, C. Aliferis, The max-min hill-climbing S 1Vienna University of Technology. Research supported by the European Research Council (COMPLEX REASON, 239962). Bayesian network structure learning algorithm, Machine Learning 65 I is the skeleton of the Alarm network. SSKEL (2006) 31–78. I (↵) is a super-structure learned with the tiling algorithm, for the STIL I E. Perrier, S. Imoto, S. Miyano, Finding optimal Bayesian network given significance level ↵. a super-structure, J. Mach. Learn. Res. 9 (2008) 2251–2286. The dynamic lower bound approach reduces the running time and the memory usage by about a half or more!

1Vienna University of Technology. Research supported by the European Research Council (COMPLEX REASON, 239962). NP-Hardness Theorem: BN Structure Learning is NP-hard. [Chickering 1996] • Proof: By reduction from FAS (NP-h for max deg 4, [Karp 1972])

u u’ u u’

f(x,{u})=1 x x’ f(x’,{u’})=1

v v f(v,{x})=1 f(v,{x’})=1 f(v,{x,x’})=2

APAC 2012 10 Stefan Szeider “poster” — 2011/12/9 — 18:55 — page 1 — #1

3rd Workshop on Discrete Optimization in Machine Learning (DISCML’11), Sierra Nevada, Spain, Dec 17, 2011

Towards Finding Optimal Polytrees

Serge Gaspers1, Mikko Koivisto2, Mathieu Liedloff3, Sebastian Ordyniak1 and Stefan Szeider1

“poster” — 2011/12/9 — 18:55 — page 1 — #1

3rd Workshop on Discrete Optimization in Machine Learning (DISCML’11), Sierra Nevada, Spain, Dec 17, 2011

Exact Probabilistic Network Structure Learning Our Method

Problem: Given a local score function f over a set of nodes N find a DAG (I) Guess an augmentedTowards arc set S := FindingD D0: Optimal Polytrees [ on N whose score is maximum with respect to f. 1 2 Guess two sets D and D0 of at most k vPf (P) v 1 2 arcs each, such that: 3 1 1 Serge Gaspers1, Mikko Koivisto2, Mathieu Liedloff3, Sebastian Ordyniak1 and Stefan Szeider1 { } I The undirected graph induced by the 4 1 0.1 { } arcs in S := D D0 is acyclic. 5 1 0.5 [ { } 3 4 5 5 1, 2 1 I D0 contains only arcs whose heads { } 3 4 5 6 4 0.9 7! are also the heads of arcs in D. { } 6 3, 4 1 I No two arcs in D0 have the same { } 7 5 0.9 head. { } Exact Probabilistic Network Structure6 Learning7 7 4, 5 1 Our Method { } 6 7 Problem: Given a local score function f (II)over Find a set an of optimal nodes extensionN find a DAGA for S:(I) Guess an augmented arc set S := D D0: A local score function f An optimal DAG for f [ on N whose score is maximum with respect to f. Using the weight function Exact Probabilistic Network Learning is NP-hard. 1 2 w(u, v):=fv( u ) fv( ) compute aGuess two sets D and D0 of at most k vPfv(P) { } ; 1 2 set A of arcs that is a heaviest commonarcs each, such that: Known Results 3 1 1 1 2 Very Few Polynomial Cases{ } independent set of the following two I The undirected graph induced by the 4 1 0.1 { } matroids: arcs in S := D D0 is acyclic. 1 2 5 1 1 0.25 3 4 5 [ { } I The in-degree matroid M (S) 5 1, 2 1 1 I D0 contains only arcs whose heads { } 3 4 5 contains all arc sets B such that no 6 4 0.9 7! 3 4 5 are also the heads of arcs in D. { } arc in B has a head that is also a 6 3, 4 1 I No two arcs in D0 have the same Theorem: An optimal branching { } head of an arc in S and no two arcs of 7 5 0.9 head. 3 4 5 3 { } 4 5 B share the6 same head;7 7 4, 5 1 can be found in polynomial { } I The acyclicity matroid M2(S) contains 6 6 7 7 A local score function f An optimal DAG for f all(II) arc Find sets anB optimalsuch that extension the undirectedA for S: (Chu and Liu 1965, Edmonds 1967) graph induced by the arcs in B S is time. [ Using the weight function 6 7 Exact Probabilistic6 Network7 Learning is NP-hard. acyclic. w(u, v):=f ( u ) f ( ) compute a v { } v ; A A branching An arc set S A with maximum score is an optimal k-branching. set A of arcs that is a heaviest common [ 1 2 Known Results independent set of the following two Restricting the structure of the resulting DAG has lead to the following Running Time: Because there are at most n3k possible arc sets S and a results: ( ) ( ) matroids: Theorem: Finding an optimal 1 2 heaviest common1 independent2 set A for M1 S and M2 S can be found in time O(n4) (Brezovec et al., 1986) the overall running time of the I The in-degree matroid M1(S) I An optimal branching can found in polynomial time (Chu and Liu, 3k+4 contains all arc sets B such that no polytree is NP-hard (Dasgupta 1999) algorithm is O(n ). 3 5 1965;Edmonds, 1967). 4 arc in B has a head that is also a I Finding an optimal polytree is NP-hard (Dasgupta, 1999). head of an arc in S and no two arcs of Open Problems 3 4 5 3 4 5 B share the same head;

I Can we find an optimal k-branching in time O(f(k)p( N )) for some I The acyclicity matroid M2(S) contains Our Result 6 7 | | computable function f and polynomial p, i.e., is the problem to find all arc sets B such that the undirected Can parameters make Mainthe Result: For every constant k an optimal k-branching can be found graph induced by the arcs in B S is an optimal k-branching fixed-parameter tractable for parameter k? [ in polynomial time. 6 7 6 7 acyclic. problem tractable? I Can we generalize k-branchings from polytrees to DAGs, i.e., can we A polytreeefficiently A branching find a maximum-score DAGAn arc that set canS beA turnedwith maximum into a score is an optimal k-branching. 1 2 branching by the deletion of at most k arcs? [ Restricting the structure of the resulting DAG has lead to the following Running Time: Because there are at most n3k possible arc sets S and a APAC 2012 11 Stefan Szeider results: heaviest common independent set A for M (S) and M (S) can be found References 1 2 in time O(n4) (Brezovec et al., 1986) the overall running time of the I An optimal branching can found in polynomial time (Chu and Liu, 3 4 5 algorithm is O(n3k+4). 1965;Edmonds, 1967). I Y. J. Chu, T. H. Liu, On the shortest arborescence of a , Science Sinica 14 (1965) 1396–1400. I Finding an optimal polytree is NP-hard (Dasgupta, 1999). I J. R. Edmonds, Optimum branchings, Journal of Research of the National BureauOpen of Problems Standards 71B (4) (1967) 233–240. 6 7 I S. Dasgupta, Learning polytrees, in: K. B. Laskey,I Can H. Pradewe find (Eds.), an optimal UAI ’99: k-branching in time O(f(k)p( N )) for some Our Result | | A 2-branching Proceedings of the Fifteenth Conference on Uncertaintycomputable in Artificial function Intelligence,f and polynomial p, i.e., is the problem to find Stockholm, Sweden, July 30-August 1, 1999, Morgan Kaufmann, 1999, pp. 134–141. Main Result: For every constant k an optimal k-branching can be found an optimal k-branching fixed-parameter tractable for parameter k? Remark: k-branchings are in betweenin polynomial branchings time. and polytrees. I C. Brezovec, G. Cornuéjols, F. Glover, Two algorithms for weighted matroid intersection, Mathematical Programming 36 (1)I Can (1986) we 39–53. generalize k-branchings from polytrees to DAGs, i.e., can we efficiently find a maximum-score DAG that can be turned into a 1Vienna University of Technology. Research supported by the European Research Council (COMPLEX1 REASON,2 239962). branching by the deletion of at most k arcs? 2University of Helsinki. Research supported by the Academy of Finland (Grand 125637). 3Université d’Orléans. Research supported by the French Agence Nationale de la Recherche (ANR AGAPE ANR-09-BLAN-0159-03). References 3 4 5 I Y. J. Chu, T. H. Liu, On the shortest arborescence of a directed graph, Science Sinica 14 (1965) 1396–1400.

I J. R. Edmonds, Optimum branchings, Journal of Research of the National Bureau of Standards 71B (4) (1967) 233–240.

6 7 I S. Dasgupta, Learning polytrees, in: K. B. Laskey, H. Prade (Eds.), UAI ’99: A 2-branching Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, Stockholm, Sweden, July 30-August 1, 1999, Morgan Kaufmann, 1999, pp. 134–141.

Remark: k-branchings are in between branchings and polytrees. I C. Brezovec, G. Cornuéjols, F. Glover, Two algorithms for weighted matroid intersection, Mathematical Programming 36 (1) (1986) 39–53.

1Vienna University of Technology. Research supported by the European Research Council (COMPLEX REASON, 239962). 2University of Helsinki. Research supported by the Academy of Finland (Grand 125637). 3Université d’Orléans. Research supported by the French Agence Nationale de la Recherche (ANR AGAPE ANR-09-BLAN-0159-03). Parameterized Complexity

• Identify a parameter k of the problem • Aim: solve the problem in time f(k)nc fixed-parameter tractable (FPT) • XP: solvable in time nf(k) • W-classes: FPT ⊆W[1] ⊆W[2]⊆···⊆ XP Downey and Fellows: Parameterized Complexity Springer 1999

APAC 2012 12 Stefan Szeider Parameters for BN Structure Learning?

1. Parameterized by the treewidth (of the super structure) 2. Parameterized Local Search 3. Parameterized by distance from being a Branching

APAC 2012 13 Stefan Szeider 1.

BN Structure learning parameterized by the treewidth (of the super structure)

APAC 2012 14 Stefan Szeider “poster” — 2011/12/9 — 21:33 — page 1 — #1

3rd Workshop on Discrete Optimization in Machine Learning (DISCML’11), Sierra Nevada, Spain, Dec 17, 2011 “poster” — 2011/12/9 — 21:33 — page 1 — #1

3rd Workshop on Discrete Optimization in Machine LearningAn (DISCML’11), Improved Sierra Nevada, Dynamic Spain, Dec 17, 2011 Programming Algorithm for Exact “poster” — 2011/12/9 — 21:33 — page 1 — #1 Bayesian Network Structure Learning An Improved Dynamic Programming Algorithm for Exact 3rd Workshop on Discrete Optimization in MachineThe Learning (DISCML’11), Super Sierra Nevada, Spain, DecStructure 17, 2011 Bayesian Network Structure Learning 1 An Improved Dynamic Programming Algorithm for Exact Sebastian Ordyniak and Stefan Szeider Bayesian Network Structure Learning Sebastian Ordyniak and Stefan Szeider1 Super structure Sf of a local score function f is the undirected • 1 Sebastian Ordyniak and Stefan Szeider graph Sf=(N,E) containing all edges that can participate in an optimal DAG [Perrier, Imoto, ExactMiyano Bayesian 2008] Network Structure Learning Super-structure Exact Bayesian Network StructureProblem: Learning Given a local score function f over a setSuper-structure of nodes N find a Idea: Restrict the search to BNs whose skeleton is contained inside an Exact Bayesian Network Structure Learning Super-structure Problem: Given a local score function f over a setBayesian of nodes N networkfind a (BN)Idea: on NRestrictwhose the score search is to maximum BNs whose with skeleton respect is contained to f. undirected inside an graph (the super-structure). Problem: Given a local score function f over a set ofBayesian nodes N find network a (BN)Idea: onRestrictN whose the search score to BNs is maximum whose skeleton with isrespect contained to insidef. undirected an graph (the super-structure). a Bayesian network (BN) on N whose score is maximum with respect to f. undirected graph (the super-structure). nPa f(n, P) a a nPf(n, P) a nPf(n, P) a a d 1 a d 1 a d { } { } 1 , , . a b, c, d 0.5 { } a b c d 0 5 { } a b, c, d 0.5 { } I A good super-structure contains b a, f 1 { } I A good super-structure contains b a, f 1 { } b a, f 1 I A good super-structure contains c e 1 b c d an optimal solution. { } an optimal solution. { } { } b cc e d1 an optimal solution.b c d d 1 7! c e I A1 super-structure can beb c d c b c d ; { } { } b I A super-structured can be e b 1 d efficiently1 obtained,7! e.g., using d 1I A super-structure7! can be { } ; ; f d 1 the tiling algorithm. efficiently obtained, e.g., using efficiently obtained, e.g., using { } e b 1 e b 1 g c, d 1 e g { } { } the tiling algorithm. { } f f d 1 f d 1 the tiling algorithm. { } { } A local score function f An optimal BN for gf c, d 1 e e f g gf c, d g 1 e g { } { } f Exact Bayesian Network Structure Learning is NP-hard. A local score function f An optimal BN for f e f g local score functionDynamic Lower Boundsuper ApproachA local structure score function f An optimalBN BN for f e f g Dynamic Programming over a Tree DecompositionExact Bayesian Network Structure Learning is NP-hard. Idea: Ignore records that do not represent optimal BNs. Exact Bayesian Network Structure LearningDynamic is NP-hard. Lower Bound Approach a, b, c, d Dynamic Programminga, overb, c, d a TreeX DecompositionX Dynamic Lower Bound Approach Idea: Ignore records that do not represent optimal BNs. Dynamic Programming over a Tree Decomposition APAC 2012 15 Stefan Szeider Idea: Ignore records that do not represent optimal BNs. a, b, c, d a, b, c, d X X b, c, d b, c, d X a, b, c, d a, b, c, d X X

, , e, b, c f, b, d g, c, d e, b, c f,bb, cd d g, c, d b, c, d X

X b, c, d b, c, d X Idea: Use a tree decomposition of a super-structure to find an optimal Method: BN via dynamic programming (Ordyniak and Szeider, UAI 2010). e, b, c I For every recordf, b,Rdcompute a lower bound and ang, upperc, d bound for e, b, c f, b, d g, c, d Method: the score of any BN represented by R. I Compactly represent locally optimal (partial) BNs via records ( ). I Ignore all records whose upper bound is below the minimum of the X lower bounds over all records seen so far. I Compute the set of all records that represent locally optimal (partial) e, b, c f, b, d g, c, d e, b, c f, b, d g, c, d BNs for each node of the tree decomposition in aIdea: bottom-upUse manner. a tree decomposition of a super-structure to find an optimal Method: BN via dynamic programming (OrdyniakExperimental and Szeider, Results UAI 2010). Bottleneck: Large Number of Records! I For every record R compute a lower bound and an upper bound for X Method: We compared the algorithm without (no LB) to the algorithm with the the score of any BN represented by R. References dynamic lower bound (LB) on the Alarm network. I Compactly represent locally optimal (partial) BNsIdea: viaUse records a tree ( ). decompositionI Ignore of all a records super-structure whose upper to bound find an is below optimal the minimumMethod: of the lower bounds over all records seen so far. I Compute the set of all records that representrunning time locallyBN (s) viamemory optimal dynamic usage (partial) (MB) programming (Ordyniak and Szeider, UAI 2010). I S. Ordyniak, S. Szeider, Algorithms and complexity results for exact I For every record R compute a lower bound and an upper bound for super-structure no LB LB no LB LB Bayesian structure learning, in: P. Grünwald, P. SpirtesBNs (Eds.), for each node of the tree decomposition in a bottom-up manner. SKEL 14 7 337 180 the score of any BN represented by R. Proceedings of UAI 2010, The 26th Conference on Uncertainty in S Method: Experimental Results TIL(0.01) 20785 6393 15679 5346 Artificial Intelligence, Catalina Island, California, USA,Bottleneck: July 8-11, Large 2010, Number ofS Records! I Compactly represent locally optimal (partial) BNs via records ( ). I Ignore all records whose upper bound is below the minimum of the TIL(0.05) 46560 16712 38525 18786 AUAI Press, Corvallis, Oregon, 2010. S lower bounds over all records seen so far. TIL(0.1) 44554 16928I 38520Compute the18918 set ofWe all compared records that the algorithm represent without locally (no optimal LB) to the (partial) algorithm with the I I. Tsamardinos, L. Brown, C. Aliferis, The max-min hill-climbing S References BNs for each nodedynamic of the tree lower decomposition bound (LB) on the in aAlarm bottom-up network. manner. Bayesian network structure learning algorithm, Machine Learning 65 I is the skeleton of the Alarm network. SSKEL (2006) 31–78. I (↵) is a super-structure learned with the tiling algorithm, for the STIL running time (s) memory usage (MB) Experimental Results I S. Ordyniak, S. Szeider,significance Algorithms level ↵ and. complexity results for exact I E. Perrier, S. Imoto, S. Miyano, Finding optimal Bayesian network given Bottleneck: Large Number of Records!super-structure no LB LB no LB LB Bayesian structure learning, in: P. Grünwald, P. Spirtes (Eds.), a super-structure, J. Mach. Learn. Res. 9 (2008) 2251–2286. 14 7 337 180We compared the algorithm without (no LB) to the algorithm with the Proceedings of UAIThe 2010, dynamic The lower 26th bound Conference approach reduces on Uncertainty the running in time and the SSKEL memory usage by about a half or more! TIL(0.01) 20785 6393 15679 5346 Artificial Intelligence, Catalina Island, California, USA, July 8-11, 2010, ReferencesS dynamic lower bound (LB) on the Alarm network. TIL(0.05) 46560 16712 38525 18786 1Vienna University of Technology. Research supported by the European ResearchAUAI Council Press, (COMPLEX Corvallis, REASON, 239962). Oregon, 2010. S TIL(0.1) 44554 16928 38520 18918 running time (s) memory usage (MB) I S. Ordyniak, S. Szeider, AlgorithmsS and complexity results for exact I I. Tsamardinos, L. Brown, C. Aliferis, The max-min hill-climbing super-structure no LB LB no LB LB Bayesian network structure learning algorithm, MachineBayesian Learning structure 65 learning,I SKEL in:is the P. Grünwald, skeleton of the P. Spirtes Alarm network. (Eds.), S SKEL 14 7 337 180 (2006) 31–78. Proceedings of UAI 2010,I The(↵) 26this a super-structure Conference on learned Uncertainty with the tiling in algorithm, for the STIL S significance level ↵. TIL(0.01) 20785 6393 15679 5346 I E. Perrier, S. Imoto, S. Miyano, Finding optimal BayesianArtificial network Intelligence, given Catalina Island, California, USA, July 8-11, 2010, S TIL(0.05) 46560 16712 38525 18786 a super-structure, J. Mach. Learn. Res. 9 (2008) 2251–2286.AUAI Press, Corvallis, Oregon, 2010. S The dynamic lower bound approach reduces the running time and the TIL(0.1) 44554 16928 38520 18918 I I. Tsamardinos, L. Brown,memory C. usage Aliferis, by about The max-min a half or more! hill-climbing S

Bayesian network structure learning algorithm, Machine Learning 65 I SKEL is the skeleton of the Alarm network. 1Vienna University of Technology. Research supported by the European Research Council (COMPLEX REASON, 239962). S (2006) 31–78. I (↵) is a super-structure learned with the tiling algorithm, for the STIL I E. Perrier, S. Imoto, S. Miyano, Finding optimal Bayesian network given significance level ↵. a super-structure, J. Mach. Learn. Res. 9 (2008) 2251–2286. The dynamic lower bound approach reduces the running time and the memory usage by about a half or more!

1Vienna University of Technology. Research supported by the European Research Council (COMPLEX REASON, 239962). Hardness Theorem: BN Structure Learning is W[1]-hard when parameterized by the treewidth of the super structure.

APAC 2012 16 Stefan Szeider Hardness Theorem: BN Structure Learning is W[1]-hard when parameterized by the treewidth of the super structure.

Proof: by reduction from multi-colored clique (MCC).

APAC 2012 16 Stefan Szeider Hardness Theorem: BN Structure Learning is W[1]-hard when parameterized by the treewidth of the super structure.

Proof: by reduction from multi-colored clique (MCC).

v11 v12 v13 v11 v12 v13

a13 a12

v33 v21 7! v33 v21

v32 v22 v32 a23 v22

v31 v23 v31 v23

Figure 7: An example for the graph G and the super-structure Sf according to the con- MCCstruction instance, used in the k=3 proof of Theorem 3.

APAC 2012v11 v12 v13 v11 v1216 v13 Stefan Szeider

a13 a12

v33 v21 7! v33 v21

v32 v22 v32 a23 v22

v31 v23 v31 v23

GD(E0)

Figure 8: The example graph G from Figure 7 together with an edge set E0, given as the bold edges in the illustration of G, and the resulting dag D(E0).

Before we go on to show Claim 5 we give some further notation and explanation. Let E E(G). We denote by V (E ) the set of all nodes incident to edges in E ,i.e.,the 0 ✓ 0 0 set e E e. We say that E0 is representable if for every 1 i

21 Hardness Theorem: BN Structure Learning is W[1]-hard when parameterized by the treewidth of the super structure.

Proof: by reduction from multi-colored clique (MCC).

v11 v11 v12 v12 v13 v13 v11 v11 v12 v12 v13 v13

a13 a13 a12 a12

v33 v21 v33 v21 v33 v21 7! 7!v33 v21 v32 v22 v32 a v22 v32 v22 v32 a23 23 v22

v31 v23 v31 v23 v31 v23 v31 v23

Figure 7: An example for the graph G and the super-structure Sf according to the con- Figure 7: An example for the graph G and the super-structure Sf according to the con- struction used in the proof of Theorem 3. MCCstruction instance, used in the k=3 proof of Theorem 3. Super Structure

v11 v12 v13 v11 v12 v13 APAC 2012v11 v12 v13 v11 v1216 v13 Stefan Szeider

a13 a12 a13 a12

v33 v21 7! v33 v21 v33 v21 7! v33 v21 v32 v22 v32 a23 v22 v32 v22 v32 a23 v22 v31 v23 v31 v23 v31 v23 v31 v23 GD(E0) GD(E0) Figure 8: The example graph G from Figure 7 together with an edge set E0, given as the Figure 8: The examplebold edges graph in theG illustrationfrom Figure of 7G together, and the with resulting an edge dag setD(EE00)., given as the bold edges in the illustration of G, and the resulting dag D(E0).

Before we go on to show Claim 5 we give some further notation and explanation. Let BeforeE weE( goG). on We to showdenote Claim by V ( 5E we) the give set some of all further nodes notation incident and to edges explanation. in E ,i.e.,the Let 0 ✓ 0 0 E0 setE(G).e E Wee. denote We say by thatV (EE00)is therepresentable set of all nodesif for incident every 1 toi

21 21 Hardness

v11 v12 v13 v11 v12 v13 Theorem: BN Structure Learning is W[1]-hard when parameterized by the treewidth of the supera13 a12 structure. v33 v21 7! v33 v21 v32 v22 v32 a23 v22

v31 v23 v31 v23

Proof: by reduction fromFigure multi-colored 7: An example for the graph cliqueG and the super-structure (MCC).Sf according to the con- struction used in the proof of Theorem 3.

v v v11 v11 v12 v12 v13 v13 v11 v11v11 v12 v1212 v13 v13v13 v11 12 v13

a13 a13 a12 a12 a13 a12

v33 v21 vv3333 v21v21 v33 v21 v33 v21 7! 7!v33 v21 7! v32 v22 vv3232 a v22v22 v32 v22 v32 v22 v32 a23 23 v22 a23

v31 v23 vv31 v23v v v v31 v23 v31 31 v23 23 31 23 GD(E0) Figure 7: An example for the graph G and the super-structure Sf according to the con- Figure 7: An example for the graph G and the super-structure Sf according to the con- MCC instance,struction used ink=3 the proof of Theorem 3.Super Structure BN corresponding to struction used in the proof of Theorem 3.Figure 8: The example graph G from Figure 7 together with an edge set E0, given as the bold edges in the illustration of G, and theclique resulting dag D(E0). v11 v12 v13 v11 v12 v13 APAC 2012v11 v12 v13 v11 v1216 v13 Stefan Szeider Before we go on to show Claim 5 we give some further notation and explanation. Let E E(G). We denote by V (E ) the set of all nodes incident to edges in E ,i.e.,the 0 ✓ 0 0 set ea. We say that Ea is representable if for every 1 i

Before we go on to show Claim 5 we give some further notation and explanation. Let BeforeE weE( goG). on We to showdenote Claim by V ( 5E we) the give set some of all further nodes notation incident and to edges explanation. in E ,i.e.,the Let 0 ✓ 0 0 E0 setE(G).e E Wee. denote We say by thatV (EE00)is therepresentable set of all nodesif for incident every 1 toi

21 21 XP-tractability Theorem: BN Structure Learning is in XP when parameterized by the treewidth of the super structure.

APAC 2012 17 Stefan Szeider XP-tractability Theorem: BN Structure Learning is in XP when parameterized by the treewidth of the super structure.

Proof: dynamic programming along a nice tree decomposition For each node U of the tree decomp. we compute a table T(U).

T(U) gives the the highest possible score any partial solution (below U) can achieve • for a given chosen parents of the nodes in the bag, and • for a given between nodes in the bag by directed paths.

APAC 2012 17 Stefan Szeider XP-tractability Theorem: BN Structure Learning is in XP when parameterized by the treewidth of the super structure.

Proof: dynamic programming along a nice tree decomposition For each node U of the tree decomp. we compute a table T(U).

T(U) gives the the highest possible score any partial solution (below U) can achieve • for a given chosen parents of the nodes in the bag, and • for a given reachability between nodes in the bag by directed paths. Corollary: If the super-structure has bounded degree, then we get fixed-parameter tractability

APAC 2012 17 Stefan Szeider Experiments

• Alarm Network: 37 nodes

APAC 2012 18 Stefan Szeider Function MaxScore Function UpperBound Input: a set F of nodes with F N Input: t V (T ) and a node n N F ✓ 2 2 \ Output: ub(t) Output: maxP (n) P F = f(n, P ) 2Pf ^ \ ; ub =0 ms = 1 for each n Ut do 2 for all P f (n) do ub = ub + MaxScore(n, ) 2 P ; if P F = and ms f(n, P ) then return ub \ ;  ms = f(n, P ) hh return ms Experiments hh • Bottleneck:Figure 2: The pseudo-code huge for the functionstables:MaxScore (Left) and UpperBound (Right). compute the score according to the Akaike Information Criterion (AIC) [1], which is the default scoring criterionSharpLib provided for by BNLEARN dynamic. It is however programming possible and straightforward to use any other type• of local score function as an input to the algorithm. The local score function together with the super-structure is then fed into a C++ implementation of the algorithm from Section 4. For our im- plementationDynamic we use SHARPLIB Lower[21], a libraryBound for decomposition-based approach algorithms.(LB) To compute the tree• decomposition of the super-structure the algorithm uses the Greedy-Fill-In heuristic [2]. Using upper bound and lower bound heuristics from LIBTREEWIDTH [15] we found that the Greedy-Fill-In heuristicOngoing is able to compute improvements an optimal tree decomposition for the considered super-structures using negligible• time. For all our experiments we use a single Opteron 6176SE core with 128 GB of RAM.

Characteristics of Sf running time (s) memory usage (MB) S V (S ) E(S ) (S ) tw(S ) no LB LB no LB LB f | f || f | f f SKEL 37 46 6 3 14 7 337 180 S (0.01) 37 62 7 5 20785 6393 15679 5346 STIL TIL(0.05) 37 63 7 5 46560 16712 38525 18786 S (0.1) 37 65 7 5 44554 16928 38520 18918 STIL Table 1: The characteristics of the considered super-structures and the running time and memory usage needed without (no LB) and with (LB) the dynamic lower bound approach.

We consider two kinds of super-structures as an input to the algorithm: APAC 2012The skeleton of the Alarm network. 19 Stefan Szeider • SSKEL Super-structures TIL(↵) learned with the tiling algorithm [23] for a significance level ↵ • 0.01, 0.05, 0.10 S. 2 { } For the super-structures learned with the tiling algorithm we additionally bound the in-degree of the resulting BN by 2 in order to reduce the search space. Because the Alarm network contains only 3 nodes with an in-degree greater than 2 (of which 2 have in-degree 3 and 1 has in-degree 4) bounding the in-degree by 2 is only a weak restriction. Table 1 shows the characteristics of these super-structures and the running time and memory usage of the algorithm with and without the dynamic lower bound approach. It can be seen that the dynamic lower bound approach reduces both the running time and the memory usage of the algorithm by about a half or more.

7 Conclusion

We have presented an improvement for a dynamic programming algorithm for exact BN structure learning that utilizes super-structures of bounded treewidth. The improvement is based on a dynamic computation of lower and upper bounds that permit to ignore certain records that do not represent optimal solutions. Our experiments show that our approach leads to a significant reduction in both the running time and the memory usage and thus improves the practical feasibility of the dynamic programming approach.

Acknowledgments Ordyniak and Szeider acknowledge the support from the European Research Council (COMPLEX REASON, 239962). 5 2.

BN Structure Learning parameterized local search

APAC 2012 20 Stefan Szeider Local Search • Once we have a candidate BN, we can try to find a better one by local editing. • We consider three local editing operations: ADD - add an arc DEL - delete an arc REV - reversing an arc • We parameterize by the number k of edits we can make in one step.

APAC 2012 21 Stefan Szeider k-Local Search

• Parameterizing LS problems by the number of changes in one step has been suggested by Fellows 2003 • Since then a series of paper on that parameterized local search have been published, considering, among others: TSP, Max-SAT, Max-Cut, Vertex Cover, Weighted SAT, Traversals, Stable Matchings, etc.

APAC 2012 22 Stefan Szeider Results

operations complexity Theorem:Except for {ADD} ADD PTIME and {DEL} all problems are DEL PTIME W[1]-hard. REV W[1]-h ADD,DEL W[1]-h Hardness even holds for instances of bounded degree. ADD,REV W[1]-h DEL,REV W[1]-h ADD,DEL,REV W[1]-h

APAC 2012 23 Stefan Szeider Results

operations complexity Theorem:Except for {ADD} ADD PTIME and {DEL} all problems are DEL PTIME W[1]-hard. REV W[1]-h ADD,DEL W[1]-h Hardness even holds for instances of bounded degree. ADD,REV W[1]-h DEL,REV W[1]-h ADD,DEL,REV W[1]-h

Does it help to bound the degree?

APAC 2012 23 Stefan Szeider W[1]-hardness for bounded degree • All known parameterized LS problems are FPT for instances of bounded degree. • Almost all W[1]-hard problems are FPT for graphs of bounded degree. • Red/Blue Nonblocker isn’t an exception. • We have now a new proof by reduction from Independent Set (we don’t need the reduced graph to have bounded degree).

APAC 2012 24 Stefan Szeider Reduction from IS for {REV}

Independent Set BN local search

w u v v

u

w w’ u’ v’

APAC 2012 25 Stefan Szeider Reduction from IS for {REV}

Independent Set BN local search

w u v v

u

w w’ u’ v’

APAC 2012 25 Stefan Szeider Reduction from IS for {REV}

Independent Set BN local search

w u v v

u

w w’ u’ v’

APAC 2012 25 Stefan Szeider Reduction from IS for {REV}

Independent Set BN local search

w u v v

u

w w’ u’ v’

degree reduction u u

w1 u

w1 w2 w3 w2 w3

APAC 2012 25 Stefan Szeider 3.

BN Structure Learning parameterized by distance from being a branching

APAC 2012 26 Stefan Szeider Beyond Branchings

APAC 2012 27 Stefan Szeider “poster” — 2011/12/9 — 18:55 — page 1 — #1

3rd Workshop on Discrete Optimization in Machine Learning (DISCML’11), Sierra Nevada, Spain, Dec 17, 2011

Towards Finding Optimal Polytrees

Serge Gaspers1, Mikko Koivisto2, Mathieu Liedloff3, Sebastian Ordyniak1 and Stefan Szeider1

Exact Probabilistic Network Structure Learning Our Method

Problem: Given a local score function f over a set of nodes N find a DAG (I) Guess an augmented arc set S := D D0: [ on N whose score is maximum with respect to f. 1 2 Guess two sets D and D0 of at most k vPf (P) v 1 2 arcs each, such that: 3 1 1 { } I The undirected graph induced by the 4 1 0.1 { } arcs in S := D D0 is acyclic. 5 1 0.5 [ { } 3 4 5 5 1, 2 1 I D0 contains only arcs whose heads { } 3 4 5 6 4 0.9 7! are also the heads of arcs in D. { } 6 3, 4 1 I No two arcs in D0 have the same { } 7 5 0.9 head. { } 6 7 7 4, 5 1 { } 6 7 A local score function f An optimal DAG for f (II) Find an optimal extension A for S: Using the weight function Exact Probabilistic Network Learning is NP-hard. w(u, v):=f ( u ) f ( ) compute a v { } v ; set A of arcs that is a heaviest common 1 2 Beyond BranchingsKnown Results independent set of the following two matroids: 1 2 1 2 I The in-degree matroid M1(S) contains all arc sets B such that no 3 5 Theorem: An optimal branching can 4 arc in B has a head that is also a head of an arc in S and no two arcs of be found in polynomial3 time.4 5 3 4 5 B share the same head; I The acyclicity matroid M2(S) contains [Chu and Liu 1965, Edmonds 1967] 6 7 all arc sets B such that the undirected graph induced by the arcs in B S is [ 6 7 6 7 acyclic. A polytree A branching An arc set S A with maximum score is an optimal k-branching. [ Restricting the structure of the resulting DAG has lead to the following Running Time: Because there are at most n3k possible arc sets S and a results: heaviest common independent set A for M1(S) and M2(S) can be found in time O(n4) (Brezovec et al., 1986) the overall running time of the I An optimal branching can found in polynomial time (Chu and Liu, algorithm is O(n3k+4). 1965;Edmonds, 1967).

I Finding an optimal polytree is NP-hard (Dasgupta, 1999). Open Problems

I Can we find an optimal k-branching in time O(f(k)p( N )) for some Our Result | | computable function f and polynomial p, i.e., is the problem to find Main Result: For every constant k an optimal k-branching can be found an optimal k-branching fixed-parameter tractable for parameter k? in polynomial time. I Can we generalize k-branchings from polytrees to DAGs, i.e., can we efficiently find a maximum-score DAG that can be turned into a 1 2 branching by the deletion of at most k arcs? APAC 2012 27 Stefan Szeider References 3 4 5 I Y. J. Chu, T. H. Liu, On the shortest arborescence of a directed graph, Science Sinica 14 (1965) 1396–1400.

I J. R. Edmonds, Optimum branchings, Journal of Research of the National Bureau of Standards 71B (4) (1967) 233–240.

6 7 I S. Dasgupta, Learning polytrees, in: K. B. Laskey, H. Prade (Eds.), UAI ’99: A 2-branching Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, Stockholm, Sweden, July 30-August 1, 1999, Morgan Kaufmann, 1999, pp. 134–141.

Remark: k-branchings are in between branchings and polytrees. I C. Brezovec, G. Cornuéjols, F. Glover, Two algorithms for weighted matroid intersection, Mathematical Programming 36 (1) (1986) 39–53.

1Vienna University of Technology. Research supported by the European Research Council (COMPLEX REASON, 239962). 2University of Helsinki. Research supported by the Academy of Finland (Grand 125637). 3Université d’Orléans. Research supported by the French Agence Nationale de la Recherche (ANR AGAPE ANR-09-BLAN-0159-03). “poster” — 2011/12/9 — 18:55 — page 1 — #1

3rd Workshop on Discrete Optimization in Machine Learning (DISCML’11), Sierra Nevada, Spain, Dec 17, 2011

Towards Finding Optimal Polytrees

Serge Gaspers1, Mikko Koivisto2, Mathieu Liedloff3, Sebastian Ordyniak1 and Stefan Szeider1

“poster” — 2011/12/9 — 18:55 — page 1 — #1

3rd Workshop on Discrete Optimization in Machine Learning (DISCML’11), Sierra Nevada, Spain, Dec 17, 2011

Exact Probabilistic Network Structure Learning Our Method

Problem: Given a local score function f over a set of nodes N find a DAG (I) Guess an augmentedTowards arc set S := FindingD D0: Optimal Polytrees [ on N whose score is maximum with respect to f. 1 2 Guess two sets D and D0 of at most k vPf (P) v 1 2 arcs each, such that: 3 1 1 Serge Gaspers1, Mikko Koivisto2, Mathieu Liedloff3, Sebastian Ordyniak1 and Stefan Szeider1 { } I The undirected graph induced by the 4 1 0.1 { } arcs in S := D D0 is acyclic. 5 1 0.5 [ { } 3 4 5 5 1, 2 1 I D0 contains only arcs whose heads { } 3 4 5 6 4 0.9 7! are also the heads of arcs in D. { } 6 3, 4 1 I No two arcs in D0 have the same { } 7 5 0.9 head. { } Exact Probabilistic Network Structure6 Learning7 7 4, 5 1 Our Method { } 6 7 Problem: Given a local score function f (II)over Find a set an of optimal nodes extensionN find a DAGA for S:(I) Guess an augmented arc set S := D D0: A local score function f An optimal DAG for f [ on N whose score is maximum with respect to f. Using the weight function Exact Probabilistic Network Learning is NP-hard. 1 2 w(u, v):=fv( u ) fv( ) compute aGuess two sets D and D0 of at most k vPfv(P) { } ; 1 2 set A of arcs that is a heaviest commonarcs each, such that: Known Results 3 1 1 1 2 Beyond Branchings { } independent set of the following two I The undirected graph induced by the 4 1 0.1 { } matroids: arcs in S := D D0 is acyclic. 1 2 5 1 1 0.25 3 4 5 [ { } I The in-degree matroid M (S) 5 1, 2 1 1 I D0 contains only arcs whose heads { } 3 4 5 contains all arc sets B such that no 6 4 0.9 7! 3 4 5 are also the heads of arcs in D. Theorem: An optimal branching can { } arc in B has a head that is also a 6 3, 4 1 I No two arcs in D0 have the same { } head of an arc in S and no two arcs of 7 5 0.9 head. be found in polynomial3 time.4 5 3 { } 4 5 B share the6 same head;7 7 4, 5 1 { } I The acyclicity matroid M2(S) contains [Chu and Liu 1965, Edmonds 1967] 6 6 7 7 A local score function f An optimal DAG for f all(II) arc Find sets anB optimalsuch that extension the undirectedA for S: graph induced by the arcs in B S is [ Using the weight function 6 7 Exact Probabilistic6 Network7 Learning is NP-hard. acyclic. w(u, v):=f ( u ) f ( ) compute a v { } v ; A polytree A branching An arc set S A with maximum score is an optimal k-branching. set A of arcs that is a heaviest common [ 1 2 Known Results independent set of the following two Restricting the structure of the resulting DAG has lead to the following Running Time: Because there are at most n3k possible arc sets S and a results: ( ) ( ) matroids: 1 2 heaviest common1 independent2 set A for M1 S and M2 S can be found in time O(n4) (Brezovec et al., 1986) the overall running time of the I The in-degree matroid M1(S) I An optimal branching can found in polynomial time (Chu and Liu, 3k+4 contains all arc sets B such that no algorithm is O(n ). 3 5 1965;Edmonds, 1967). 4 arc in B has a head that is also a I Finding an optimal polytree is NP-hard (Dasgupta, 1999). head of an arc in S and no two arcs of Open Problems Theorem: Finding an optimal 3 4 5 3 4 5 B share the same head;

I Can we find an optimal k-branching in time O(f(k)p( N )) for some I The acyclicity matroid M2(S) contains polytree is NP-hard [Dasgupta 1999] Our Result 6 7 | | computable function f and polynomial p, i.e., is the problem to find all arc sets B such that the undirected Main Result: For every constant k an optimal k-branching can be found an optimal k-branching fixed-parameter tractable for parameter k? graph induced by the arcs in B S is in polynomial time. [ 6 7 I Can we6 generalize7 k-branchings from polytrees to DAGs, i.e., can we acyclic. A polytreeefficiently A branching find a maximum-score DAGAn arc that set canS beA turnedwith maximum into a score is an optimal k-branching. 1 2 branching by the deletion of at most k arcs? [ Restricting the structure of the resulting DAG has lead to the following Running Time: Because there are at most n3k possible arc sets S and a APAC 2012 27 Stefan Szeider results: heaviest common independent set A for M (S) and M (S) can be found References 1 2 in time O(n4) (Brezovec et al., 1986) the overall running time of the I An optimal branching can found in polynomial time (Chu and Liu, 3 4 5 algorithm is O(n3k+4). 1965;Edmonds, 1967). I Y. J. Chu, T. H. Liu, On the shortest arborescence of a directed graph, Science Sinica 14 (1965) 1396–1400. I Finding an optimal polytree is NP-hard (Dasgupta, 1999). I J. R. Edmonds, Optimum branchings, Journal of Research of the National BureauOpen of Problems Standards 71B (4) (1967) 233–240. 6 7 I S. Dasgupta, Learning polytrees, in: K. B. Laskey,I Can H. Pradewe find (Eds.), an optimal UAI ’99: k-branching in time O(f(k)p( N )) for some Our Result | | A 2-branching Proceedings of the Fifteenth Conference on Uncertaintycomputable in Artificial function Intelligence,f and polynomial p, i.e., is the problem to find Stockholm, Sweden, July 30-August 1, 1999, Morgan Kaufmann, 1999, pp. 134–141. Main Result: For every constant k an optimal k-branching can be found an optimal k-branching fixed-parameter tractable for parameter k? Remark: k-branchings are in betweenin polynomial branchings time. and polytrees. I C. Brezovec, G. Cornuéjols, F. Glover, Two algorithms for weighted matroid intersection, Mathematical Programming 36 (1)I Can (1986) we 39–53. generalize k-branchings from polytrees to DAGs, i.e., can we efficiently find a maximum-score DAG that can be turned into a 1Vienna University of Technology. Research supported by the European Research Council (COMPLEX1 REASON,2 239962). branching by the deletion of at most k arcs? 2University of Helsinki. Research supported by the Academy of Finland (Grand 125637). 3Université d’Orléans. Research supported by the French Agence Nationale de la Recherche (ANR AGAPE ANR-09-BLAN-0159-03). References 3 4 5 I Y. J. Chu, T. H. Liu, On the shortest arborescence of a directed graph, Science Sinica 14 (1965) 1396–1400.

I J. R. Edmonds, Optimum branchings, Journal of Research of the National Bureau of Standards 71B (4) (1967) 233–240.

6 7 I S. Dasgupta, Learning polytrees, in: K. B. Laskey, H. Prade (Eds.), UAI ’99: A 2-branching Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, Stockholm, Sweden, July 30-August 1, 1999, Morgan Kaufmann, 1999, pp. 134–141.

Remark: k-branchings are in between branchings and polytrees. I C. Brezovec, G. Cornuéjols, F. Glover, Two algorithms for weighted matroid intersection, Mathematical Programming 36 (1) (1986) 39–53.

1Vienna University of Technology. Research supported by the European Research Council (COMPLEX REASON, 239962). 2University of Helsinki. Research supported by the Academy of Finland (Grand 125637). 3Université d’Orléans. Research supported by the French Agence Nationale de la Recherche (ANR AGAPE ANR-09-BLAN-0159-03). “poster” — 2011/12/9 — 18:55 — page 1 — #1

3rd Workshop on Discrete Optimization in Machine Learning (DISCML’11), Sierra Nevada, Spain, Dec 17, 2011

Towards Finding Optimal Polytrees

Serge Gaspers1, Mikko Koivisto2, Mathieu Liedloff3, Sebastian Ordyniak1 and Stefan Szeider1

“poster” — 2011/12/9 — 18:55 — page 1 — #1

3rd Workshop on Discrete Optimization in Machine Learning (DISCML’11), Sierra Nevada, Spain, Dec 17, 2011

Exact Probabilistic Network Structure Learning Our Method

Problem: Given a local score function f over a set of nodes N find a DAG (I) Guess an augmentedTowards arc set S := FindingD D0: Optimal Polytrees [ on N whose score is maximum with respect to f. 1 2 Guess two sets D and D0 of at most k vPf (P) v 1 2 arcs each, such that: 3 1 1 Serge Gaspers1, Mikko Koivisto2, Mathieu Liedloff3, Sebastian Ordyniak1 and Stefan Szeider1 { } I The undirected graph induced by the 4 1 0.1 { } arcs in S := D D0 is acyclic. 5 1 0.5 [ { } 3 4 5 5 1, 2 1 I D0 contains only arcs whose heads { } 3 4 5 6 4 0.9 7! are also the heads of arcs in D. { } 6 3, 4 1 I No two arcs in D0 have the same { } 7 5 0.9 head. { } Exact Probabilistic Network Structure6 Learning7 7 4, 5 1 Our Method { } 6 7 Problem: Given a local score function f (II)over Find a set an of optimal nodes extensionN find a DAGA for S:(I) Guess an augmented arc set S := D D0: A local score function f An optimal DAG for f [ on N whose score is maximum with respect to f. Using the weight function Exact Probabilistic Network Learning is NP-hard. 1 2 w(u, v):=fv( u ) fv( ) compute aGuess two sets D and D0 of at most k vPfv(P) { } ; 1 2 set A of arcs that is a heaviest commonarcs each, such that: Known Results 3 1 1 1 2 Beyond Branchings { } independent set of the following two I The undirected graph induced by the 4 1 0.1 { } matroids: arcs in S := D D0 is acyclic. 1 2 5 1 1 0.25 3 4 5 [ { } I The in-degree matroid M (S) 5 1, 2 1 1 I D0 contains only arcs whose heads { } 3 4 5 contains all arc sets B such that no 6 4 0.9 7! 3 4 5 are also the heads of arcs in D. Theorem: An optimal branching can { } arc in B has a head that is also a 6 3, 4 1 I No two arcs in D0 have the same { } head of an arc in S and no two arcs of 7 5 0.9 head. be found in polynomial3 time.4 5 3 { } 4 5 B share the6 same head;7 7 4, 5 1 { } I The acyclicity matroid M2(S) contains [Chu and Liu 1965, Edmonds 1967] 6 6 7 7 A local score function f An optimal DAG for f all(II) arc Find sets anB optimalsuch that extension the undirectedA for S: graph induced by the arcs in B S is [ Using the weight function 6 7 Exact Probabilistic6 Network7 Learning is NP-hard. acyclic. w(u, v):=f ( u ) f ( ) compute a v { } v ; k-branching = polytree suchA that polytree after A branching An arc set S A with maximum score is an optimal k-branching. set A of arcs that is a heaviest common [ 1 2 Known Results independent set of the following two deleting certain k arcs,Restricting every the structure node of the has resulting DAG has lead to the following Running Time: Because there are at most n3k possible arc sets S and a results: ( ) ( ) matroids: 1 2 heaviest common1 independent2 set A for M1 S and M2 S can be found at most one parent. in time O(n4) (Brezovec et al., 1986) the overall running time of the I The in-degree matroid M1(S) I An optimal branching can found in polynomial time (Chu and Liu, 3k+4 contains all arc sets B such that no algorithm is O(n ). 3 5 1965;Edmonds, 1967). 4 arc in B has a head that is also a I Finding an optimal polytree is NP-hard (Dasgupta, 1999). head of an arc in S and no two arcs of Open Problems Theorem: Finding an optimal 3 4 5 3 4 5 B share the same head;

I Can we find an optimal k-branching in time O(f(k)p( N )) for some I The acyclicity matroid M2(S) contains polytree is NP-hard [Dasgupta 1999] Our Result 6 7 | | computable function f and polynomial p, i.e., is the problem to find all arc sets B such that the undirected Main Result: For every constant k an optimal k-branching can be found an optimal k-branching fixed-parameter tractable for parameter k? graph induced by the arcs in B S is in polynomial time. [ 6 7 I Can we6 generalize7 k-branchings from polytrees to DAGs, i.e., can we acyclic. A polytreeefficiently A branching find a maximum-score DAGAn arc that set canS beA turnedwith maximum into a score is an optimal k-branching. 1 2 branching by the deletion of at most k arcs? [ Restricting the structure of the resulting DAG has lead to the following Running Time: Because there are at most n3k possible arc sets S and a APAC 2012 27 Stefan Szeider results: heaviest common independent set A for M (S) and M (S) can be found References 1 2 in time O(n4) (Brezovec et al., 1986) the overall running time of the I An optimal branching can found in polynomial time (Chu and Liu, 3 4 5 algorithm is O(n3k+4). 1965;Edmonds, 1967). I Y. J. Chu, T. H. Liu, On the shortest arborescence of a directed graph, Science Sinica 14 (1965) 1396–1400. I Finding an optimal polytree is NP-hard (Dasgupta, 1999). I J. R. Edmonds, Optimum branchings, Journal of Research of the National BureauOpen of Problems Standards 71B (4) (1967) 233–240. 6 7 I S. Dasgupta, Learning polytrees, in: K. B. Laskey,I Can H. Pradewe find (Eds.), an optimal UAI ’99: k-branching in time O(f(k)p( N )) for some Our Result | | A 2-branching Proceedings of the Fifteenth Conference on Uncertaintycomputable in Artificial function Intelligence,f and polynomial p, i.e., is the problem to find Stockholm, Sweden, July 30-August 1, 1999, Morgan Kaufmann, 1999, pp. 134–141. Main Result: For every constant k an optimal k-branching can be found an optimal k-branching fixed-parameter tractable for parameter k? Remark: k-branchings are in betweenin polynomial branchings time. and polytrees. I C. Brezovec, G. Cornuéjols, F. Glover, Two algorithms for weighted matroid intersection, Mathematical Programming 36 (1)I Can (1986) we 39–53. generalize k-branchings from polytrees to DAGs, i.e., can we efficiently find a maximum-score DAG that can be turned into a 1Vienna University of Technology. Research supported by the European Research Council (COMPLEX1 REASON,2 239962). branching by the deletion of at most k arcs? 2University of Helsinki. Research supported by the Academy of Finland (Grand 125637). 3Université d’Orléans. Research supported by the French Agence Nationale de la Recherche (ANR AGAPE ANR-09-BLAN-0159-03). References 3 4 5 I Y. J. Chu, T. H. Liu, On the shortest arborescence of a directed graph, Science Sinica 14 (1965) 1396–1400.

I J. R. Edmonds, Optimum branchings, Journal of Research of the National Bureau of Standards 71B (4) (1967) 233–240.

6 7 I S. Dasgupta, Learning polytrees, in: K. B. Laskey, H. Prade (Eds.), UAI ’99: A 2-branching Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, Stockholm, Sweden, July 30-August 1, 1999, Morgan Kaufmann, 1999, pp. 134–141.

Remark: k-branchings are in between branchings and polytrees. I C. Brezovec, G. Cornuéjols, F. Glover, Two algorithms for weighted matroid intersection, Mathematical Programming 36 (1) (1986) 39–53.

1Vienna University of Technology. Research supported by the European Research Council (COMPLEX REASON, 239962). 2University of Helsinki. Research supported by the Academy of Finland (Grand 125637). 3Université d’Orléans. Research supported by the French Agence Nationale de la Recherche (ANR AGAPE ANR-09-BLAN-0159-03). XP-Result

Theorem: finding an optimal k-branching is polynomial for every constant k. (i.e., the problem is in XP for parameter k.)

APAC 2012 28 Stefan Szeider XP-Result

Theorem: finding an optimal k-branching is polynomial for every constant k. (i.e., the problem is in XP for parameter k.)

Approach: (1) guess the deleted arcs (2) solve the induced problem for branchings turns out to be quite tricky!

APAC 2012 28 Stefan Szeider A k-branching (k=4)

+ + = D k-branching = X deletion set = X‘ co-set = R rest + = R∪X’=D-X polytree with in-deg ≤1 |X|≤k and |X’|≤k. Hence for constant k we can try all X and X‘ in polynomial time. For each choice of X∪X’ we compute the best set R such that R∪X’ is a polytree with in-deg ≤1.

APAC 2012 29 Stefan Szeider Matroid Approach

• Let X, X’ be two disjoint sets of arcs. • We define define two matroids over N×N. • The in-degree matroid where a set R is independent iff R ∩ (X∪X’) = ∅ and R has indegree ≤1. • The acyclicity matroid where a set R is independent iff R∪X∪X’ has no undirected cycles. • Lemma: D is a k-branching iff D-X is independent in both matroids.

APAC 2012 30 Stefan Szeider Matroid Intersection • Hence, for each guessed set X we need to find an optimal set D that is independent in both matroids. • Optimality can be expressed by considering weighted matroids: w(u,v)= f(v,{u})−f(v,∅). • Solution can be found by means of a poly time algorithm for weighted matroid intersection (Brezovec, Cornuejols, Glover 1986).

APAC 2012 31 Stefan Szeider FPT? • We don’t know whether the k-branching problem is FPT. • We have FPT algorithms for special cases whenever the guess-phase is FPT. • For instance: the arcs in X∪X’ form an in-tree and each node has a bounded number of potential parent sets. • BN structure learning remains NP-hard under this restriction.

APAC 2012 32 Stefan Szeider A slight generalization is W[1]-hard

Define a k-node branching as a polytree that becomes a branching by deletion of k nodes.

APAC 2012 33 Stefan Szeider A slight generalization is W[1]-hard

Define a k-node branching as a polytree that becomes a branching by deletion of k nodes.

Theorem: Finding an optimal k-node branching is W[1]-hard.

APAC 2012 33 Stefan Szeider A slight generalization is W[1]-hard

Define a k-node branching as a polytree that becomes a branching by deletion of k nodes.

Theorem: Finding an optimal k-node branching is W[1]-hard.

Proof: reduction from multicolored clique.

APAC 2012 33 Stefan Szeider Conclusion • BN Network Structure Learning • important and notoriously difficult • parameterized complexity approach: • treewidth of super structure • parameterized local search • k-branchings and k-node branchings • Other parameters?

APAC 2012 34 Stefan Szeider