EM-Training for Weighted Aligned Hypergraph Bimorphisms

Frank Drewes Kilian Gebhardt and Heiko Vogler Department of Computing Science Department of Computer Science Umea˚ University Technische Universitat¨ Dresden S-901 87 Umea,˚ Sweden D-01062 Dresden, Germany [email protected] [email protected] [email protected]

Abstract malized by the new concept of hybrid grammar. Much as in the mentioned synchronous grammars, We develop the concept of weighted a hybrid grammar synchronizes the derivations aligned hypergraph bimorphism where the of nonterminals of a string grammar, e.g., a lin- weights may, in particular, represent proba- ear context-free system (LCFRS) (Vijay- bilities. Such a bimorphism consists of an Shanker et al., 1987), and of nonterminals of a R 0-weighted regular tree grammar, two ≥ tree grammar, e.g., regular tree grammar (Brainerd, hypergraph algebras that interpret the gen- 1969) or simple definite-clause programs (sDCP) erated trees, and a family of alignments (Deransart and Małuszynski, 1985). Additionally it between the two interpretations. Seman- synchronizes terminal symbols, thereby establish- tically, this yields a set of bihypergraphs ing an explicit alignment between the positions of each consisting of two hypergraphs and the string and the nodes of the tree. We note that an explicit alignment between them; e.g., LCFRS/sDCP hybrid grammars can also generate discontinuous phrase structures and non- non-projective dependency structures. projective dependency structures are bihy- In this paper we focus on the task of training an pergraphs. We present an EM-training al- LCFRS/sDCP hybrid grammar, that is, assigning gorithm which takes a corpus of bihyper- probabilities to its rules given a corpus of discon- graphs and an aligned hypergraph bimor- tinuous phrase structures or non-projective depen- phism as input and generates a sequence dency structures. Since the alignments are first of weight assignments which converges to class citizens, we develop our approach in the a local maximum or saddle point of the general framework of hypergraphs and hyperedge likelihood function of the corpus. replacement (HR) (Habel, 1992). We define the concepts of bihypergraph (for short: bigraph) and 1 Introduction aligned HR bimorphism. A bigraph consists of In natural language processing alignments play an hypergraphs H1, λ, and H2, where λ represents important role. For instance, in machine translation the alignment between H1 and H2. A bimorphism they show up as hidden information when training = (g, , Λ, ) consists of a regular tree gram- B A1 A2 probabilities of dictionaries (Brown et al., 1993) or mar g generating trees over some ranked alphabet when considering pairs of input/output sentences Σ, two Σ-algebras and which interpret each A1 A2 derived by a synchronous grammar (Lewis and symbol in Σ as an HR operation (thus evaluating Stearns, 1968; Chiang, 2007; Shieber and Schabes, every tree to two hypergraphs), and a Σ-indexed 1990; Nederhof and Vogler, 2012). As another family Λ of alignments between the two interpreta- example, in language models for discontinuous tions of each σ Σ. The semantics of is a set of ∈ B phrase structures and non-projective dependency bigraphs. structures they can be used to capture the connec- For instance, each discontinuous phrase struc- tion between the words in a natural language sen- ture or non-projective dependency structure can be tence and the corresponding nodes of the parse tree represented as a bigraph (H1, λ, H2) where H1 and or dependency structure of that sentence. H2 correspond to the string component and the tree In (Nederhof and Vogler, 2014) the generation component, respectively. Fig. 1 shows an example of discontinuous phrase structures has been for- of a bigraph representing a non-projective depen-

60 Proceedings of the ACL Workshop on Statistical NLP and Weighted Automata, pages 60–69, Berlin, Germany, August 12, 2016. c 2016 Association for Computational Linguistics is . hearing scheduled A on today (a) issue the

A hearing is scheduled on the issue today .

(0) inp (0) out (y1 ) (y1 ) is .

hearing scheduled

H2 A on today (b) issue

the

λ

A hearing is scheduled on the issue today . H1 (0) inp (0) out (y1 ) (y1 )

Figure 1: (a) A sentence with non-projective dependencies is represented in (b) by a bigraph (H1, λ, H2). Both hypergraphs H1 and H2 contain a distinct hyperedge (box) for each word of the sentence. H1 specifies the linear order on the words. H2 describes parent-child relationships between the words, where children form a list to whose start and end the parent has a tentacle. The alignment λ establishes a one-to-one correspondence between the (input vertices of the) hyperedges in H1 and H2. dency structure. We present each LCFRS/sDCP ities. We show that the complexity of construct- hybrid grammar as a particular aligned HR bimor- ing the reduct is polynomial in the size of g and phism; this establishes an initial algebra semantics (H , λ, H ) if is an LCFRS/sDCP hybrid gram- 1 2 B (Goguen et al., 1977) for hybrid grammars. mar. However, as the algorithm itself is not limited The flexibility of aligned HR bimorphisms goes to this situation, we expect it to be useful in other well beyond hybrid grammars as they generalize cases as well. the synchronous HR grammars of (Jones et al., 2012), making it possible to synchronously gen- 2 Preliminaries erate two graphs connected by explicit alignment Basic mathematical notation We denote the set structures. Thus, they can for instance model align- of natural numbers (including 0) by N and the set ments involving directed acyclic graphs like Ab- N 0 by N . For n N, we denote 1, . . . , n \{ } ∈ { } stract Meaning Representations (Banarescu et al., by [n]. An alphabet A is a finite set of symbols. 2013) or Millstream systems (Bensch et al., 2014). We denote the set of all strings over A by A∗, the Our training algorithm takes as input an aligned empty string by ε, and A ε by A+. We denote ∗ \{ } HR bimorphism = (g, 1, Λ, 2) and a corpus the length of s A by s and, for each i [ s ], B A A ∈ ∗ | | ∈ | | c of bigraphs. It is based on the dynamic program- the ith item in s by s(i), i.e., s is identified with the ming variant (Baker, 1979; Lari and Young, 1990; function s:[ s ] A such that s = s(1) s( s ). | | → ··· | | Prescher, 2001) of the EM-algorithm (Dempster et We denote the range s(1), . . . , s( s ) of s by [s]. { | | } al., 1977) and thus approximates a local maximum The powerset of a set A is denoted by (A). The P or saddle point of the likelihood function of c. canonical extension of a function f : A B to → In order to calculate the significance of each f : (A) (B) and to f : A B are de- P → P ∗ → ∗ rule of g for the generation of a single bigraph fined as usual and denoted by f as well. We denote (H , λ, H ) occurring in c, we proceed as usual, the restriction of f : A B to A A by f . 1 2 → 0 ⊆ |A0 constructing the reduct £ (H , λ, H ) which For an equivalence relation on B we denote B 1 2 ∼ generates the singleton (H1, λ, H2) via the same the equivalence class of b B by [b] and the ∈ ∼ derivation trees as and preserves the probabil- quotient of B modulo by B/ . For f : A B ∼ ∼ →

61 B we define the function f/ : A B/ by with initial nonterminal S and the following rules: ∼ → ∼ f/ (a) = [f(a)] . In particular, for a string s ∼ ∼ ∈ S σ1(A, B) A σ2 B∗ we let s/ = [s(1)] [s( s )] . → → ∼ ∼ ··· | | ∼ B σ (C,D) C σ D σ → 3 → 4 → 5 Terms, regular tree grammars, and algebras We observe that S σ (σ , σ (σ , σ )). Let η, ⇒g∗ 1 2 3 4 5 A ranked alphabet is a pair (Σ, rk) where Σ is ζ, and ζ0 be the following trees (in order): an alphabet and rk: Σ N is a mapping associat- → B σ (C,D) S σ (A, B) S σ (A, B) ing a rank with each symbol of Σ. Often we just → 3 → 1 → 1 1 write Σ instead of (Σ, rk). We abbreviate rk (k) C σ4 D σ5 A σ2 B A σ2 η − → → → → by Σk. In the following let Σ be a ranked alphabet. Then ζ DS(T ,B) because ζ DS(T ) and Let A be an arbitrary set. We let Σ(A) denote ∈ g Σ 0 ∈ g Σ the left-hand side of the root of η is B. the set of strings σ(a1, . . . , ak) k N, σ  { | ∈ ∈ Σk, a1, . . . , ak A (where the parentheses and A Σ-algebra is a pair = (A, (σ σ Σ)) ∈ } A A | ∈ commas are special symbols not in Σ). The set of where A is a set and σ is a k-ary operation on A A well-formed terms over Σ indexed by A, denoted for every k N and σ Σk. As usual, we will ∈ ∈ by T (A), is defined to be the smallest set T such sometimes use to refer to its carrier set A or, Σ A that A T and Σ(T ) T . We abbreviate TΣ( ) conversely, denote by A (and thus σ by σA) ⊆ ⊆ ∅ A A by T and write σ instead of σ() for σ Σ . if there is no risk of confusion. The Σ- alge- Σ ∈ 0 1 A regular tree grammar (RTG) (Gecseg´ and bra is the Σ-algebra TΣ with σTΣ (t1, . . . , tk) = σ(t1, . . . , tk) for every k N, σ Σk, and Steinby, 1984) is a tuple g = (Ξ, Σ, ξ0,R) where ∈ ∈ t , . . . , t T . For each Σ-algebra there is Ξ is an alphabet (nonterminals), Ξ Σ = , el- 1 k ∈ Σ A ∩ ∅ exactly one Σ-, denoted by [[.]] , ements in Σ are called terminals, ξ0 Ξ (initial A ∈ from the Σ-term algebra to (Wechler, 1992). nonterminal), R is a ranked alphabet (rules); each A rule in R has the form ξ σ(ξ , . . . , ξ ) where k → 1 k Hypergraphs and hyperedge replacement In ξ, ξ , . . . , ξ Ξ, σ Σ . We denote the set of all 1 k ∈ ∈ k the following let Γ be a finite set of labels.A Γ- rules with left-hand side ξ by R for each ξ Ξ. ξ ∈ hypergraph is a tuple H = (V,E, att, lab, ports), Since RTGs are particular context-free gram- where V is a finite set of vertices, E is a finite set of mars, the concepts of derivation relation and gen- hyperedges, att: E V ε is the attachment → ∗ \{ } erated language are inherited. The language of the of hyperedges to vertices, lab: E Γ is the label- → RTG g is the set of all well-formed terms in TΣ ing of hyperedges, and ports V is a sequence ∈ ∗ generated by g; this language is denoted by (g). of (not necessarily distinct) ports. The set of all L ξ Γ-hypergraphs is denoted by . The vertices in We define the Ξ-indexed family (Dg ξ HΓ ξ | ∈ V [ports] are also called internal vertices and Ξ) of mappings Dg :TΣ (TR); for each \ ξ → P denoted by int(H). term t TΣ, Dg(t) is the set of t’s deriva- ∈ For the sake of brevity, we shall in the following tion trees in TR which start with ξ and yield t. simply call Γ-hypergraphs and hyperedges graphs Formally, for each ξ Ξ and σ(t1, . . . , tk) ξ ∈ ∈ and edges, respectively. We illustrate a graph in fig- TΣ, the set Dg(σ(t1, . . . , tk)) contains each term ures as follows (cf., e.g., graph((σ2) ) in Fig. 2a). %(d1, . . . , dk) where % = (ξ σ(ξ1, . . . , ξk)) A ξi → A vertex v is illustrated by a circle, which is filled is in R and di Dg (ti) for each i [k]. ∈ ξ ξ ∈ and labeled by i in case that ports(i) = v. An edge We define Dg(t) = ξ Ξ Dg(t) and Dg(TΣ) = ∈ e with label γ and att(e) = v1 . . . vn is depicted as ξ ξ0 Dg(t). Finally, Dg (T , ξ) is the set of all t TΣ S Σ a γ-labeled rectangle with n tentacles, lines point- ∈ ξ0 ζ T ( ξ ) such that there is a ζ Dg (T ) ing to v1, . . . , vn which are annotated by 1, . . . , n. S∈ R { } 0 ∈ Σ which has a subtree whose root is in Rξ, and ζ is (We sometimes drop these annotations.) obtained from ζ0 by replacing exactly one of these If we are not interested in the particular set of subtrees by ξ. labels Γ, then we also call a Γ-graph simply graph and write instead of Γ. In the following, we Example 2.1. Let Σ = Σ0 Σ2 where Σ0 = H H ∪ will refer to the components of a graph H by index- σ , σ , σ and Σ = σ , σ . Let g be an RTG { 2 4 5} 2 { 1 3} ing them with H unless they are explicitly named. Let H and H0 be graphs. H and H0 are disjoint 1in this context we use “tree” and “term” as synonyms if V V = and E E = . H ∩ H0 ∅ H ∩ H0 ∅

62 1 2 3 4 1 2 1 2 1 2 3 is 4 3 . 4 3 hearing 4 1 e1 2 3 scheduled 4 3 on 4 2 1 1 2 1 2 1 2 1 2 3 issue 4 (a) 3 e1 4 1 e2 2 3 4 3 4 3 4 A today 1 2 1 2 1 e2 2 2 4 1 2 1 2 3 the 4 1 3 1 2

graph((σ1) ) graph((σ2) ) graph((σ3) ) graph((σ4) ) graph((σ5) ) A A A A A

[[σ1(σ2, σ3(σ4, σ5))]] = A 1 2 1 2 3 is 4 3 . 4 3 is 4 3 . 4 1 2 1 2 1 2 1 2 = = 3 hearing 4 1 [[σ3(σ4, σ5)]] 2 3 [[σ2]] 4 1 [[σ3(σ4, σ5)]] 2 A A A 1 2 4 2 4 3 4 3 1 3 A (b) 1 2 1 2 1 2 3 is 4 3 . 4 3 is 4 3 . 4 1 2 1 2 1 2 1 2 3 hearing 4 1 [[σ4]] 2 = 3 hearing 4 3 scheduled 4 A 2 1 2 2 1 1 3 4 3 4 3 4 3 A 4 1 [[σ5]] 2 A on today A 1 2 1 2 1 2 1 2 ··· Figure 2: (a) The (Σ, Γ)-HR algebra and (b) the evaluation of the term σ (σ , σ (σ , σ )) in . A 1 2 3 4 5 A

Let E = E E . If att = att and graphs, we assume that this renaming is opaque. H ∩ H0 H |E H0 |E lab = lab , then the union of the graphs In this sense, we may define an HR operation as a H |E H0 |E H and H is the graph H H = (V V ,E total function from k to as follows. 0 ∪ 0 H ∪ H0 H ∪ H H EH , attH attH , labH labH , ports ports ). Let H be a graph. For pairwise distinct 0 ∪ 0 ∪ 0 H H0 For every F EH let H F = (VH ,E, edges e1, . . . , ek EH , the HR operation ⊆ \ k ∈ attH E, labH E, portsH ) where E = EH F . H e1...e : is given by | | \ h ki H → H Let k N and H,H1,...,Hk be pair- ∈ ∈ H wise disjoint. Let e1, . . . , ek EH be pairwise H e1...e (H1,...,Hk) = H[e1/H1, . . . , ek/Hk] ∈ h ki distinct edges, called variables. Let I be the graph H e1, . . . , ek H1 Hk. The hyperedge for all graphs H ,...,H . \{ } ∪ ∪ · · · ∪ 1 k ∈ H replacement (HR) of e1 by H1, . . . , and ek by Hk A (Σ, Γ)-HR algebra (Courcelle, 1991) is a Σ- in H (Bauderon and Courcelle, 1987; Habel and algebra = ( Γ, (σ )σ Σ) where, for every k A H A ∈ ∈ Kreowski, 1987) yields the graph N and σ Σk, we have σ = H e1...e for some ∈ A h ki H Γ and pairwise distinct e1, . . . , ek EH . H[e1/H1, . . . , ek/Hk] ∈ H ∈ Then we denote H by graph(σ ). An HR algebra = (VI / ,EI , attI / , labI , ports / ) A ∼ ∼ H ∼ is a (Σ, Γ)-HR algebra for some Σ and Γ. where is the least equivalence relation on VI such An example of a (Σ, Γ)-HR algebra and the ∼ A that attH (ei, j) portsH (j) for every i [k] [[.]] ∼ i ∈ application of to a term are given in Fig. 2. and j [min( att (e ) , ports )]. In the fol- A ∈ | H i | | Hi | lowing, we call the equivalence relation involved ∼ 3 Bigraphs and Aligned HR in the HR that yields H[e1/H1, . . . , ek/Hk]. Bimorphisms We assume that each variable ei is labeled by a distinguished symbol Σ and depict ei by ei Now we formally introduce our central notions ⊥ ∈ of bigraph and aligned HR bimorphism. A case instead of . Throughout this paper, we will not ⊥ study with examples follows in the next section. distinguish between isomorphic graphs, i.e., graphs Throughout this section let ∆ and Ω be alphabets. that are identical up to a bijective renaming of ver- tices and edges. However, since hyperedge replace- Definition 3.1. A bigraph of type (∆, Ω) is a triple ment is defined on concrete graphs and requires B = (H , λ, H ) where H and H 1 2 1 ∈ H∆ 2 ∈ HΩ that H,H1,...,Hk are pairwise disjoint, we may are disjoint, and λ is an alignment of H1 and H2, choose isomorphic copies of the involved graphs, i.e., a graph with V = V V , E E = λ H1 ∪ H2 λ ∩ H1 i.e., rename edges or vertices. To avoid the cum- E E = , att(e) (V )+ (V )+ for each λ ∩ H2 ∅ ∈ H1 · H2 bersome conversion between abstract and concrete e E , and ports = ε. ∈ λ λ 

63 (0) (0) (a) (0) ... (0) σ (0) ... (0) (b) (y )inp (y )out x1 xm0 y1 yn0 1 1 σ1 is .

(0) (0) e1 en0

(1) (1) (2) (2) y1 x1 x1 x2

(1) (1) (n) (n) e1 e2 e1 em1 e1 emk

(1) ... (1) (1) ... (1) (k) ... (k) (k) ... (k) inp out y1 ym1 x1 xn1 y1 ymk x1 xnk inp (2) out x2 (c) is . (d) ··· (1) x1 e1 ek (1) (2) (2) x1 x1 x2 (2) x1

σ σ1 (0) (1) Figure 3: The graphs (a) HIO, (b) H , (c) s1 , and (d) s1 . The large circles and the dotted lines in (a) and (b) visualize the underlying term structure; e.g., in (b) σ has two children because σ Σ . 1 1 ∈ 2 L M L M Definition 3.2. An aligned HR bimorphism of type 4.1 A Graph View on Sequence Terms (Σ, ∆, Ω) is a tuple = (g, , Λ, ), where B A1 A2 Let Γ be a ranked alphabet and Y be a set disjoint g is an RTG over Σ and , are a (Σ, ∆)-HR A1 A2 from Γ. The sets of terms and sequence-terms (s- algebra and a (Σ, Ω)-HR algebra, resp., such that terms) over Γ indexed by Y (Seki and Kato, 2008) graph(σ 1 ) and graph(σ 2 ) are disjoint for each A A are denoted by Γ(Y ) and Γ∗(Y ), respectively, σ Σ. Further, Λ is a Σ-indexed family (Λ σ T T ∈ σ | ∈ and defined inductively as follows: Σ), each Λσ being an alignment of graph(σ 1 ) A 1. Y Γ(Y ), and graph(σ 2 ).  ⊆ T A 2. if k N, γ Γk and si Γ∗(Y ) for each In the following, for each term t TΣ, we as- ∈ ∈ ∈ T ∈ i [k], then γ(s1, . . . , sk) Γ(Y ), and sume (w.l.o.g.) that [[t]] 1 and [[t]] 2 are disjoint. ∈ ∈ T A A 3. if n N and ti Γ(Y ) for each i [n], Definition 3.3. Let = (g, 1, Λ, 2) be an ∈ ∈ T ∈ B A A then t1, . . . , tn ∗(Y ). aligned HR bimorphism. The semantics of , de- h i ∈ TΓ B Let s (Y ). We say that s is linear if every noted by ( ), is the set of bigraphs defined as ∈ TΓ∗ L B y Y occurs at most once in s. In the following follows. ∈ we only consider linear s-terms. We note that, if First, let the -alignment be the TΣ-indexed B Γ = Γ0, then s is essentially a string over Γ0 and family Λ = (Λ (t) t TΣ) where Λ (t) B B | ∈ B Y . If Γ = Γ1, then s corresponds to a sequence of is the alignment of [[t]] 1 and [[t]] 2 defined induc- A A ordinary (unranked) terms over Γ1 indexed by Y . tively on t as follows. Let t = σ(t1, . . . , tk) TΣ. ∈ Every linear s-term s can be represented as a For j [2], suppose that j is the equivalence ∈ ∼ s inp out relation involved in the hyperedge replacement graph : it has two distinct ports and , representing the start and end of s, resp. For each that yields graph(σ j )[e1/[[t1]] j , . . . , ek/[[tk]] j ]. inp A A A variableL yM Y , s has two distinct ports y and We assume that Λσ, Λ (t1),..., Λ (tk) have pair- ∈ B B yout . For each occurrence of a symbol γ Γ in wise disjoint sets of edges. Then we define the ∈ k s, there is a γ-labeledL M edge with 2k + 2 tentacles -alignment of [[t]] 1 and [[t]] 2 to be the graph B A A in s . The (2i 1)-th and 2i-th tentacle (i [k]) − ∈ Λ (t) = (Λσ Λ (t1) Λ (tk))/( 1 2) point to the start and end vertex, respectively, of B ∪ B ∪· · ·∪ B ∼ ∪∼ theLi-thM child sequence of γ. The last two tentacles and we let [[t]] = ([[t]] 1 , Λ (t), [[t]] 2 ). Finally, B A B A point towards the predecessor and the successor of we define ( ) = [[t]] t (g) .  L B { B | ∈ L } γ, respectively: this may be a vertex separating two 4 Case Study: Hybrid Grammars symbols, the start or end vertex of a (sub-)sequence, or the port realizing yinp or yout for some y Y . ∈ We show how an LCFRS/sDCP hybrid grammar For instance, the s-term (Nederhof and Vogler, 2014) can be represented as (0) (1) (2) an aligned HR bimorphism. These grammars deal s = is( x , x ), . ( ) 1 h h 1 1 i hi i with sequence terms; hence, we first recall their (1) (2) (2) (0) definition and show how to view sequence terms as in Γ∗( x1 , x1 , x2 ) is represented by s1 in T { (2)} inp (2) out (2) particular graphs. Fig. 3c. The ports (x2 ) and (x2 ) for x2 L M 64 σ1 σ1 (0) (0) (1) (1) are depicted as filled circles to the left and the right tively. Then H = HIO[e1 / s1 , e1 / s1 ] (2) of x2 , and similarly for the other variables. Note is the graph in Fig. 3b where dashed lines indicate (1) out (2) inp (1) the identification of vertices. Note that Hσ1 equals that (x1 ) and (x1 ) coincide because x1 L M L M (2) (0) graph((σ1) ) in Fig. 2a. is succeeded by x in s . A 1 1 A (Σ, Γ)-HR algebra is a (Σ, Γ)-attribute A 4.2 LCFRS, sDCP, and LCFRS/sDCP grammar algebra ((Σ, Γ)-AG algebra), if each Hybrid Grammars symbol σ in Σ is interpreted as described above. For instance, of Fig. 2a is a (Σ, Γ)-AG algebra. Here we formalize LCFRS/sDCP hybrid grammars A as particular aligned HR bimorphisms, where the We observe that (Σ, Γ)-AG algebras have the following property: for every edge e EHσ , if algebras 1 and 2 are an LCFRS algebra and an ∈ A A lab σ (e) Γ then e has 2k + 2 tentacles. We sDCP algebra, resp. Since both LCFRS and sDCP H ∈ k can be viewed as particular types of attribute gram- call the vertex attHσ (e)(k + 1) the input vertex mars (AG), we first define the concept of AG alge- of e and denote it by inp(e). Note that no two σ bra and, in a second step, instantiate it to LCFRS terminal edges in H have the same input vertex. algebra and sDCP algebra. This single-input property will be crucial for an effi- σ cient representation of subgraphs during the reduct For each σ Σk, let syn = (n0, . . . , nk) k+1 ∈σ A k+1 ∈ construction (cf. Sec. 5.2). N and inh = (m0, . . . , mk) N be tu- A (0) ∈ (i) Next we instantiate the concept of AG-algebra to ples defining the sets I = yj j [n0] yj { | ∈ (0)}∪{ | LCFRS algebras and to sDCP algebras. An LCFRS i [k], j [mi] and O = xr r does not have inherited attributes: ∈ (`∈) } { | ∈ [m0] xr ` [k], r [n`] of inside and } ∪ { | ∈ ∈ } Definition 4.1. Let ∆ = ∆0 be a ranked alphabet. outside attributes, resp. (The abbreviations stem A (Σ, ∆)-AG algebra is a (Σ, ∆)-LCFRS algebra, from the AG notions synthesized attributes and in- σ A if inh = (0,..., 0) for all σ Σ.  herited attributes.) The definition of an AG algebra A ∈ follows the two-phase approach in (Engelfriet Definition 4.2. Let Ω = Ω1 be a ranked alphabet A and let be a (Σ, Ω)-AG algebra. We say that and Heyker, 1992). In the first phase, for each A A σ is a (Σ, Ω)-sDCP algebra. symbol σ in Σk, we define a graph HIO of type  σ synσ , inh as shown in Fig. 3a: there is a pair Then the graph view on an LCFRS/sDCP A A of vertices for each inside attribute y(i) and each hybrid grammar is an HR bimorphism = j B (`) (i) (g, 1, Λ, 2), where outside attribute xr . For each inside attribute yj A A (i) (i) g = (Ξ, Σ, ξ0,R) is an RTG, there is a edge ej which connects yj with all out- • 1 is a (Σ, ∆)-LCFRS algebra, side attributes. For the edge e(0) the tentacles are •A 1 is a (Σ, Ω)-sDCP algebra, ∆ = Ω (regard- shown completely in Fig. 3a; for the other edges •A 2 ing ∆ and Ω as sets of symbols), and the tentacles to outside attributes are abridged for there are functions fan: Ξ N , inh: Ξ clarity. The edges e1, . . . , ek correspond to the k • → → N, and syn: Ξ N such that fan(ξ0) = 1, successors of σ. → inh(ξ ) = 0, syn(ξ ) = 1, and for every (ξ In the second phase, we choose an I-indexed 0 0 → (i) (i) σ(ξ1, . . . , ξk)) R, it holds that family of s-terms (sj Γ∗(O) yj I) such ∈ ∈ T | ∈ σ (`) (i) – (fan(ξ), fan(ξ1),..., fan(ξk)) = syn and that each xr in O occurs exactly once in all s 1 j σA together (single syntactic use restriction). Then – (inh(ξ), inh(ξ1),..., inh(ξk)) = inh 2 and (i) (i) (syn(ξ), syn(ξ ),..., syn(ξ )) = synAσ 1 k 2 . we replace each edge ej by the graph sj ; this A specifies a particular information flow. Formally, Moreover, we require the following: Let σ Σ and ∈ we set σ = Hσ where H = graph(σ ) for j [2]. For each e E e1...e L M j j Λσ A h ki A ∈ ∈ we have attΛσ (e) = inp(e1) inp(e2) where e1 σ σ (i) (i) (i) ∈ H = H [e / s y I] . EH1 , e2 EH2 , and labH1 (e1) = labH2 (e2). IO j j | j ∈ ∈ Example 4.3. Let g be as in Ex. 2.1 and con- For instance, for σ Σ we have synσ1 = 1 L∈ M2 sider the LCFRS/sDCP hybrid grammar = (1, 1, 2) and inhσ1 = (0, 1, 0), and so weA con- B A (g, 1, Λ, 2), where 1, Λ, and 2 are as speci- σ1 (0) A A A A struct HIO accordingly. Next, we choose s1 fied in Fig. 4. Then the bigraph in Fig. 1b equals (1) and s1 as depicted in Fig. 3c and 3d, respec- [[σ1(σ2, σ3(σ4, σ5))]] and is thus in ( ).  B L B L M L M 65 (0) inp (0) out (0) inp (0) out (0) inp (0) out (y1 ) (y1 ) (y1 ) (y1 ) (y1 ) (y1 ) is . hearing on (1) out (2) inp (x1 ) , (x1 ) (σ2) 2 (1) inp (2) out A (x ) e1 e2 (x ) (σ5) 2 issue 1 1 A A (σ1) 2 (0) (0) A inp out (x1 ) (x1 ) (y(1))out (x(2))out 1 , 2 the

Λσ2 (1) inp (2) inp (y1 ) ,(x2 ) Λσ5 A hearing (σ2) 1 (0) inp (0) out Λσ1 A (y ) (y ) 1 1 on the issue (σ5) 1 (0) (0) A (y )inp (y )out e1 is e2 . 1 1 (σ1) 1 (0) inp (0) out A (y1 ) (y1 )

(0) inp (0) out (y1 ) (y1 ) scheduled (0) inp (0) out (0) inp (0) out (y1 ) (y1 ) (y2 ) (y2 ) (σ4) 2 (σ3) 2 A A e1 e2 today

Λσ3

(0) inp (1) inp (1) out (2) inp (2) out (1) inp (1) out (0) out (y1 ) , (x1 ) (x1 ) , (x1 ) (x1 ) , (x2 ) (x2 ) , (y1 ) Λσ4

(σ3) 1 A scheduled today (σ4) 1 (0) inp (0) out (0) inp (0) out e1 e2 A (y1 ) (y1 ) (y2 ) (y2 )

Figure 4: The interpretation of σ , . . . , σ in , Λ, and . 1 5 A1 A2

ξ0 5 EM Training where p000 : Dg (TΣ, ξ) R 0 is defined in the → ≥ same way as p , with the addition that p (ξ) = 1. We present a training algorithm which takes as in- 0 000 As usual, we will drop the primes from p , p , and put a weighted aligned HR bimorphism and a finite, 0 00 p . non-empty corpus c of bigraphs. It is essentially 000 the same as the training algorithm for probabilistic Definition 5.1. A weighted aligned HR bimor- phism is a pair ( , p) = ((g, , Λ, ), p) where context-free grammars (PCFG) (Baker, 1979; Lari B A1 A2 (g, , Λ, ) is an aligned HR bimorphism and and Young, 1990; Nederhof and Satta, 2008). As A1 A2 shown in (Prescher, 2001), this algorithm is a dy- (g, p) is a WRTG.  namic programming variant of the EM-algorithm 5.2 Reduct Construction (Dempster et al., 1977). Thus, our algorithm gener- Given a weighted aligned HR bimorphism ( , p) = ates a sequence of probability assignments which B ((g, , Λ, ), p) and a bigraph (H , λ, H ), we converges to a probability assignment pˆ; the like- A1 A2 1 2 restrict g to an RTG g such that only trees t (g) lihood of c under pˆ is a local maximum or saddle 0 ∈ L satisfying [[t]] = (H1, λ, H2) are in (g0). Also, point of the likelihood function of c. B L we show that if is an LCFRS/sDCP hybrid gram- B 5.1 Weighted Aligned HR Bimorphisms mar, then g0 can be constructed in time polynomial in the size of and (H , λ, H ). We define weighted RTG in a similar way as PCFG B 1 2 was defined in (Nederhof and Satta, 2006). Definition 5.2. Let m N and H .A ∈ ∈ H A weighted regular tree grammar (WRTG) is graph H0 is an m-subgraph of H if int(H0) ⊆ a pair (g, p) where g = (Ξ, Σ, ξ0,R) is an RTG int(H), EH EH , labH = labH E , and 0 ⊆ 0 | H0 and p: R R 0 is the weight assignment.A portsH m. Moreover, we require that there → ≥ | 0 | ≤ weight assignment p is a probability assignment if is a mapping ϕ: VH VH , called vertex map- 0 → p(ρ) = 1 for each ξ Ξ. We extend p to ping, such that ϕ(attH (e)) = attH (e) for each ρ Rξ 0 ∈ ∈ e E ϕ(v) = v v int(H ) the mapping p0 : Dg(TΣ) R 0 on derivations as H , for each 0 , and P → ≥ ∈ ∈ d = %(d , . . . , d ) D (T ) ϕ(int(H0)) ϕ([portsH ]) = . Moreover, for follows: for each 1 k in g Σ we ∩ 0 ∅ k each v int(H0), if v [attH (e)] for some define p0(d) = p(%) i=1 p0(di). For each t TΣ ∈ ∈ · ∈ e E , then e E . The set of all m-subgraphs we define p00(t) = ξ p0(d). We define the H H0 Qd Dg(t) ∈ ∈ m ∈ of H is denoted by S (H).  mappings in: Ξ R 0 (inside weight) H →P ≥ ∪ {∞} For instance, graph((σ ) ) in Fig. 2a is a 2- and out: Ξ R 0 (outside weight) for 4 → ≥ ∪ {∞} A each ξ Ξ by subgraph of the last graph in Fig. 2b. If a graph H ∈ is the result of applying an HR operation to graphs in(ξ) = p0(d) out(ξ) = p000(d) H ,...,H , then each H is a ports -subgraph 1 k i | Hi | ξ ξ0 d Dg(TΣ) d D (T ,ξ) of H. (For this, the mapping ϕ in Definition 5.2 is ∈ X ∈ Xg Σ

66 needed, because some of the ports of Hi may be Then the statement is an immediate consequence of identified with each other in H.) Hence, for the the constructions of R0 and φ and Claim (*).  reduct we consider only m-subgraphs of H, where m is the maximal port length of HR operations Complexity We determine the complexity of in or . We observe that m(H) is finite the reduct construction for the special case of A1 A2 HS because we identify isomorphic graphs. LCFRS/sDCP hybrid grammars. We assume that the maximal length m of ports in the HR operations Definition 5.3. Let ( , p) = ((g, 1, Λ, 2), p) B A A is fixed and not part of the input. In preparation, be a weighted aligned HR bimorphism with g = we determine an upper bound on Ξ0 . (Ξ, Σ, ξ ,R) and let (H , λ, H ) be a bigraph. | | 0 1 2 Let H be such that from each vertex there We define ( , p) £ (H , λ, H ), the reduct ∈ H B 1 2 is an (undirected) path to a port. Given H each of ( , p) with respect to (H1, λ, H2), to m B H0 (H) [[TΣ]] is uniquely determined ∈ HS ∩ A be the weighted aligned HR bimorphism by its boundary representation (Lautemann, 1990; ((g , , Λ, ), p ) where g and p are defined as 0 A1 A2 0 0 0 Chiang et al., 2013; Groschwitz et al., 2015), which follows. (ports , ϕ(ports )) 1 consists of (a) the pair H0 H0 , If [[.]] − (H1, λ, H2) = , then g0 = ( ξ0 , Σ, B 0 (b) the set of boundary edges E of H0 consisting B ∅ { } H0 ξ0 , ) and p0 = . Otherwise, let m N e E att (e) [ports ] = ∅ ∅ ∈ of all H0 such that H0 H0 , be the maximum of all portsgraph(σ ) and ∈ B ∩ 6 ∅ | A1 | and (c) a function att0 : E ([portsH ] )∗ H0 → 0 ∪{⊥} portsgraph(σ ) where σ Σ. Now, we con- such that att (e)(i) = att (e)(i) if att (e)(i) | A2 | ∈ 0 H0 H0 g = ( , , ξ ,R ) struct 0 Ξ0 Σ 0 0 where we abbreviate [portsH ], and otherwise. Note that (c) is nec- m m m m ∈ 0 ⊥ S (H1) S (λ) S (H2) by S (H1, λ, H2): essary because ϕ might not be injective. H ×H ×H H [portsH ] m | 0 Ξ0 = Ξ ( (H1, λ, H2) [[TΣ]] ), Now, for an arbitrary (Σ, Γ)-AG algebra and • × HS ∩ B A H ∗( ) the following holds. Due to the ξ0 = (ξ0,H1, λ, H2), and ∈ TΓ ∅ • single-input property of H and of the involved HR for every rule % = (ξ σ(ξ1, . . . , ξk)) R • → ∈ operations, for each m-subgraph H0 the informa- and (s, η, r), (s1, η1, r1),..., (sk, ηk, rk) in L M m tion (b) and (c) can be inferred from (a). There are S (H1, λ, H2) [[TΣ]] we have H ∩ B Sm,k port sequences of length m with k distinct S Stirling number of the %0 = (ξ, s, η, r) σ((ξ1, s1, η1, r1),..., vertices, where m,k is the → second kind. For each port we choose a vertex in (ξk, sk, ηk, rk)) R0 m ∈ VH . Thus, we obtain that S (H) [[TΣ]] m k m |H m ∩ A| ≤  k=0 Sm,k VH m VH . if s = σ 1 (s1, . . . , sk), η = Λσ η1 ... ηk, · | | ≤ · | | A ∪ ∪ ∪ Next, we analyze the -alignments of an and r = σ 2 (r1, . . . , rk). P B A LCFRS/sDCP hybrid grammar . Let (s, η, r) We define p0(%0) = p(%).  m B ∈ (H1, λ, H2) and t TΣ with [[t]] = (s, η, r). HS ∈ B Theorem 5.4. In Def. 5.3 the following hold: Then Vη = Vs Vr. Let e Eλ and 1 ∪ ∈ 1. (g0) = [[.]] − (H1, λ, H2) (g). e1 EH1 and e2 EH2 with attλ(e) = L B ∩ L ∈ ∈ inp(e ) inp(e ). (i) Assume that there is t T 2. There is a deterministic tree relabeling φ from 1 2 0 ∈ Σ with [[t0]] = (H1, λ, H2) and t is a subtree of t0. Dg0 to Dg such that for all t (g0) and B ∈ L Then e1 and e2 are unique by the single input prop- d0 Dg0 (t), φ D (t) is a bijection between g0 ∈ | erty. Hence, e1 Es iff e2 Er iff e Eη Dg0 (t) and Dg(t), and p0(d0) = p(φ(d0)). ∈ ∈ ∈ because t is a subtree of t0 and by the structure of 1 Proof. If [[.]] − (H1, λ, H2) = , then (g0) = Λσ. (ii) If there is no such t0, then the equivalences B ∅ L by construction, and thus, both statements of under (i) may be violated, in which case (s, η, r) ∅ the theorem hold. Otherwise, the first statement can safely be pruned. follows from the following claim. Thus, for each LCFRS/sDCP hybrid grammar

Claim (*) For every n N, ξ Ξ, t TΣ, and and each (H1, λ, H2) with H1 ∆∗( ) and m ∈ ∈ ∈ B ∈ T ∅ (s, η, r) ( S (H1, λ, H2)) [[TΣ]] it holds that H2 Ω∗( ) we obtain the upper bound Ξ ∈ Hn n ∩ B 2m∈ T m∅ m | | · (ξ, s, η, r) t iff ξ g t and [[t]] 1 = s and m VH1 VH2 on Ξ0 . This bound can be g0 A L M ⇒ ⇒ · | | · |2m | 2|fan∗| 2 (syn∗+inh∗) Λ (t) = η and [[t]] 2 = r. refinedL to Ξ mM VH1 · VH2 · , B A | |· ·| | ·| | For the proof of the second statement we define where f ∗ = maxξ Ξ f(ξ) for f fan, syn, inh . ∈ ∈ { } φ((ξ, s, η, r)) = ξ for each (ξ, s, η, r) Ξ , and Constructing Ξ and R simultaneously with a de- ∈ 0 0 0 extend φ in the canonical way to derivation trees. ductive parsing algorithm (Shieber et al., 1995) has

67 k a worst-case time complexity in ( Ξ0 ) where k References O | | is the maximum rank of Σ. James K. Baker. 1979. Trainable grammars for speech recognition. In Speech Communication Papers for 5.3 EM Training Algorithm the 97th Meeting of the Acoustical Society of Amer- In the first step of our training algorithm, a cor- ica, pages 547–550. pus c0 : R R 0 is computed as follows. After → ≥ Laura Banarescu, Claire Bonial, Shu Cai, Madalina initialization (line 3), each bigraph B occurring Georgescu, Kira Griffitt, Ulf Hermjakob, Kevin in c is considered (line 4), the reduct ( , p ) £ B B i Knight, Philipp Koehn, Martha Palmer, and Nathan is built (line 5), the inside/outside weights of the Schneider. 2013. Abstract meaning representation new WRTG (g0, p0) are calculated (line 6), and ac- for sembanking. In Proc. 7th Linguistic Annotation cording to these weights and the current weight Workshop, ACL 2013 Workshop. assignment pi the count c0(%) of each rule is incre- Michel Bauderon and Bruno Courcelle. 1987. Graph mented (lines 8–9). In the second step, the corpus expressions and graph rewriting. Mathematical Sys- c0 is normalized (lines 10–14) and the result is the tems Theory, 20:83–127. next probability assignment p (line 15). i+1 Suna Bensch, Frank Drewes, Helmut Jurgensen,¨ and Brink van der Merwe. 2014. Graph transformation Algorithm 5.1 EM-training algorithms for for incremental natural language analysis. Theoreti- weighted aligned HR bimorphisms. cal Computer Science, 531:1–25. Input: weighted aligned HR bimorphism ( , p ) = ((g, , Λ, ), p ) Walter S. Brainerd. 1969. Tree generating regular sys- B 0 A1 A2 0 tems. Inform. and Control, 14:217–231. with g = (Ξ, Σ, ξ0,R), and a finite, non-empty corpus c of bigraphs. Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, and Robert L. Mercer. 1993. Output: sequence p1, p2, p3,... of improved The mathematics of statistical machine translation: probability assignments for R. parameter estimation. Computational Linguistics, 1: i 0 19(2):263–311. ← 2: while true do David Chiang, Jacob Andreas, Daniel Bauer, 3: initialize counts c0 : R R 0: c0(%) 0 → ≥ ← Karl Moritz Hermann, Bevan Jones, and Kevin 4: for B = (H1, λ, H2) s.t. c(B) > 0 do Knight. 2013. Parsing graphs with hyperedge 5: ((g0, 1, Λ, 2), p0) ( , pi) £ B replacement grammars. In Proc. of the 51st Annual A A ← B with RTG g0 = (Ξ0, Σ, ξ0 ,R0) and Meeting of the Association for Computational det. tree rel. φ // cf. Thm 5.4 Linguistics, pages 924–932. 6: compute out and in for the WRTG David Chiang. 2007. Hierarchical phrase-based trans- (g0, p0) lation. Computational Linguistics, 33(2):201–228. 7: if in(ξ0 ) = 0 then 0 6 8: for % = (ξ σ(ξ1, . . . , ξk)) R do Bruno Courcelle. 1991. The monadic second-order → ∈ of graphs V: on closing the gap between de- 9: c (%) c (%) + c(B) in(ξ ) 1 0 0 0 − finability and recognizability. Theoretical Computer ← ·k · Science, 80:153–202. out(ξ0) pi(%) in(ξj0 ) % R : · · j=1 0∈ 0 Arthur P. Dempster, Nan M. Laird, and Donald B. Ru- φ(%0)=% %0=(Pξ0 σ(ξ10 ,...,ξk0 )) Q ∧ → bin. 1977. Maximum likelihood from incomplete 10: for ξ Ξ do ∈ data via the EM algorithm. Journal of the Royal Sta- 11: s % R c0(%) tistical Society, Series B, 39:1–38. ← ∈ ξ 12: for % Rξ do P∈ 1 Pierre Deransart and Jan Małuszynski. 1985. Relat- 13: if s = 0 then pi+1(%) pi(%) Rξ − ← 1 ·| | ing logic programs and attribute grammars. J. Logic 14: else pi+1(%) s− c0(%) ← · Programming, 2:119–155. 15: output pi+1 and i i + 1 ← Joost Engelfriet and Linda Heyker. 1992. Context- free hypergraph grammars have the same term- generating power as attribute grammars. Acta Infor- Acknowledgment matica, 29(2):161–210.

We thank the referees for their careful reading of Ferenc Gecseg´ and Magnus Steinby. 1984. Tree Au- the manuscript. tomata. Akademiai´ Kiado,´ Budapest. (See also arXiv:1509.06233, 2015).

68 Joseph A. Goguen, James W. Thatcher, Eric G. Wagner, Hiroyuki Seki and Yuki Kato. 2008. On the gener- and Jesse B. Wright. 1977. Initial algebra semantics ative power of multiple context-free grammars and and continuous algebras. J. ACM, 24:68–95. macro grammars. IEICE - Transactions on Informa- tion and Systems, E91-D(2):209–221. Jonas Groschwitz, Alexander Koller, and Christoph Teichmann. 2015. Graph parsing with s-graph Stuart M. Shieber and Yves Schabes. 1990. Syn- grammars. In Proc. of the 53rd Annual Meeting of chronous tree-adjoining grammars. In Proc. of the Association for Computational Linguistics and the 13th International Conference on Computational the 7th International Joint Conference on Natural Linguistics, volume 3, pages 253–258. Language Processing, pages 1481–1490. Stuart M. Shieber, Yves Schabes, and Fernando C. N. Annegret Habel and Hans-Jorg¨ Kreowski. 1987. May Pereira. 1995. Principles and implementation of de- we introduce to you: Hyperedge replacement. In ductive parsing. J. Logic Programming, 24(1–2):3 – Proc. of the Third Intl. Workshop on Graph Gram- 36. mars and Their Application to Computer Science, pages 15–26. Krishnamurti Vijay-Shanker, David J. Weir, and Ar- avind K. Joshi. 1987. Characterizing structural Annegret Habel. 1992. Hyperedge Replacement: descriptions produced by various grammatical for- Grammars and Languages, volume 643 of Lecture malisms. In Proc. of the 25th Annual Meeting of the Notes in Computer Science. Springer. Association for Computational Linguistics, pages 104–111. Bevan Jones, Jacob Andreas, Daniel Bauer, Karl Moritz Hermann, and Kevin Knight. 2012. Wolfgang Wechler. 1992. for Com- Semantics-based machine translation with hy- puter Scientists, volume 25 of EATCS Monographs peredge replacement grammars. In M. Kay and on Theoretical Computer Science. Springer. C. Boitet, editors, Proc. 24th Intl. Conf. on Com- putational Linguistics (COLING 2012): Technical Papers, pages 1359–1376.

Karim Lari and Steve J. Young. 1990. The estimation of stochastic context-free grammars using the Inside- Outside algorithm. Computer Speech and Language, 4:35–56.

Clemens Lautemann. 1990. The complexity of graph languages generated by hyperedge replace- ment. Acta Inf., 27(5):399–421.

Philip M. Lewis and Richard E. Stearns. 1968. Syntax- directed transduction. J. ACM, 15(3):465–488.

Mark-Jan Nederhof and Giorgio Satta. 2006. Proba- bilistic parsing strategies. J. ACM, 53(3):406–436.

Mark-Jan Nederhof and Giorgio Satta. 2008. Prob- abilistic parsing. In Gemma Bel-Enguix, M. Do- lores Jimenez-L´ opez,´ and Carlos Mart´ın-Vide, edi- tors, New Developments in Formal Languages and Applications, volume 113 of Studies in Computa- tional Intelligence, pages 229–258. Springer.

Mark-Jan Nederhof and Heiko Vogler. 2012. Syn- chronous context-free tree grammars. In Proc. of the 11th International Workshop on Tree Adjoin- ing Grammars and Related Formalisms (TAG+11), pages 55–63.

Mark-Jan Nederhof and Heiko Vogler. 2014. Hy- brid grammars for discontinuous parsing. In Proc. of 25th International Conference on Computational Linguistics (COLING 2014), pages 1370–1381.

Detlef Prescher. 2001. Inside-outside estimation meets dynamic EM. In Proc. of the 7th International Work- shop on Parsing Technologies, pages 241–244.

69