Department of Computer Science

Submitted in part fulfilment for the degree of BSc. Generation of Uniformly-Random Graphs

Mark Henrick

29th of April 2019

Supervisor: Detlef Plump Contents

Executive Summary iv 0.1 Ethics ...... v

1 Introduction 1

2 Literature Review 2 2.1 Preliminaries and Notation ...... 2 2.2 Generation of Strings u.a.r...... 2 2.3 Mairson’s Methods ...... 3 2.3.1 A Space-Time Tradeoff ...... 3 2.3.2 Example: Balanced Brackets ...... 4 2.4 Ambiguous String Grammars ...... 6 2.5 Hypergraphs and Hyperedge-Replacement Grammars . . . 7 2.5.1 Informal Overview ...... 7 2.5.2 Formal Overview ...... 8

3 Adaptation of String Algorithms to HRGs 11 3.1 Prior Work ...... 11 3.2 Substrings and Concatenation for Graphs ...... 11 3.3 Length and Size ...... 12 3.4 Normal Form ...... 12 3.4.1 The Modified Mairson Algorithm ...... 13 3.5 Embedding of String Grammars in HRGs ...... 14

4 Implementation 15 4.1 Language Choice ...... 15 4.2 User Interface ...... 16 4.2.1 Deduplication ...... 17

5 Evaluation 19 5.1 Grammars ...... 19 5.1.1 The Palindrome Grammar ...... 19 5.1.2 The a∗|bbb Grammars ...... 19 5.1.3 The “AB” Grammar ...... 20 5.2 Correctness ...... 20 5.2.1 Ambiguity ...... 20 5.3 Performance ...... 21 5.3.1 Methodology ...... 21

ii Contents

5.3.2 Results ...... 22

6 Conclusion 26 6.1 Opportunities for Further Work ...... 26

iii Executive Summary

This project aims to produce a program which accepts as input an un- ambiguous context-free hyperedge replacement grammar and produces random hypergraphs of a given size1. The important property is that these hypergraphs should be generated uniformly at random (u.a.r.), meaning that every hypergraph of the specified size that the grammar produces should be generated with equal probability. A primary use case of this software is to produce random inputs for testing graph algorithms, there- fore the generator should write its results in a computer-readable format suitable to be used in other programs. The only similar programs found in the literature review are past BSc projects, which use slightly different algorithms or software platforms than this project. In chapter 1 further background on the state of the art is given. Chapter 2 explores the existing string algorithms and explains hypergraphs and hyperedge replacement grammars (HRGs). Chapter 3 details the adaptation of a string algorithm by Harry Mairson [1] to HRGs. While the original algorithm required the input be in Chom- sky normal form, these restrictions are partially relaxed in the process of adapting the algorithm to account for the properties of hypergraphs. In chapter 4 the implementation is covered, including software platform choice and a discussion of how to remove many duplicate graphs from the output in an efficient manner. The algorithm is implemented in Java [2], and results are rendered in JSON [3], a widely-supported data interchange format. The evaluation of the program is detailed in chapter 5. The program produces graphs with the expected distribution, but unfortunately is found to be generally slower than a program produced by Jake Coxon [4], while performing faster than one produced by Carla Lawrence [5]. The report is concluded in chapter 6, which details some areas for further work.

1The size of a hypergraph is the sum of the numbers of hyperedges and nodes

iv Executive Summary

0.1 Ethics

As this project is purely adapting and implementing a rather abstract math- ematical algorithm, there are no direct ethical implications. As usual, academic integrity must be maintained, and is of heightened importance due to the existence of similar student projects in the area.

v 1 Introduction

Graphs are one of the most ubiquitous and versatile data structures in computer science and discrete mathematics. As they can be used to model a large number of problems, there is considerable interest in the manipulation of graphs, resulting in programming languages designed specifically for that purpose, such as GP2 [6]. A concern for all software development is testing, which can take the form of formal verification, hand-written assertions, or generating random tests cases that are checked for certain invariants. The aim of this project is to develop a method which can be used to generate graphs derived from a specific grammar uniformly at random (u.a.r.). These outputs can then be used as random inputs to a graph algorithm, allowing semi-automatic testing. Random graph generators do exist, such as Stanford GraphBase [7], however these are a lot less powerful than what is needed for this project. The existing generators primarily generate “ordinary” graphs — we will be generating hypergraphs — and rarely give much control over the “shape” of the graph. We will be generating graphs from hyperedge-replacement grammars, which allow powerful specification of graph languages. Prior work in the area of uniform generation from grammars has primarily focused on strings, and for these it is mostly a solved problem, as detailed in section 2.2, however work extended this to graphs is limited, as detailed in section 3.1. This project aims to add to this prior work with a new hypergraph generator for the Java platform, using a variant of an existing string algorithm by Mairson (detailed in section 2.3), which is faster than a similar program written by Lawrence (see again section 3.1).

1 2 Literature Review

2.1 Preliminaries and Notation

This report presumes rudimentary knowledge of context-free string gram- mars. We will define a context-free grammar (CFG) for strings as G = (N, Σ, P, S) where N is a set of nonterminals (variables), Σ is the terminal alphabet (disjoint from N), P ⊆ N × (N ∪ Σ)∗ is the set of productions (rules) and S ∈ N is the start symbol.

L is the language generated by the grammar and we write L` for the ` sublanguage restricted to a specific string length, L` = L ∩ Σ . Note that ` while L may be (countably) infinite, L` is finite with cardinality at most |Σ| . We will write terminals as lowercase and nonterminals as capitals. ⇒ notates direct derivation and ⇒∗ means derivation by any number of steps. e denotes the empty string, and ` will be used throughout to denote the length of the string, or size of the hypergraph, which we wish to generate.

2.2 Generation of Strings u.a.r.

The problem of generating strings uniformly at random (u.a.r.) from a context-free grammar (CFG) has received substantial attention. We formally specify the problem as follows: given inputs of a CFG G and a length ` > 0, describe an algorithm to select a string from L` with probability 1/|L`|. Methods based on choosing available productions u.a.r. will not work, as strings with shorter derivations will be more likely to be generated. Intuitively one can think of a total language tree — this approach would only work if it were perfectly balanced. Hickey and Cohen [8] present two algorithms for unambiguous grammars. This work is improved upon by Mairson [1], who presents two algorithms that give a tradeoff between linear time via use of a quadratic-size data structure v.s. quadratic time and linear space.

2 2 Literature Review

2.3 Mairson’s Methods

Mairson presumes an unambiguous grammar G that is in Chomsky normal form (CNF). This means any production is of the form A → BC (which I will call “binary productions”), A → x (which I will call “terminal productions”), or S → e (which I will call “the empty production”). There is a well-known terminating algorithm to convert any CFG to this form. Generation is considered with regard to a certain starting symbol, which may not be the “global” S of the grammar — we will refer to it as I for “initial”. If ` < 2, we simply choose a random production I → x where |x| = ` and return x (in practice the only possibilities for ` = 0 are S → e or failure). This leaves the case of ` > 1. Mairson defines the “potential”1 of a symbol for a given length, denoted as ||A||` as the number of strings of length ` that can be derived from A in any number of steps. This can be defined as the number of strings that can be directly derived from A (which will all be of length 1), plus the potential of each production with an LHS of A.

The production potential ||A → BC||` is the number of strings of length ` with a derivation starting with A → BC. To generate such a string we have a choice of where to split the length of the string generated from B, and that from C , as long as the lengths are positive and sum to `. In other words, ||A → BC||` = ∑ ||B||k · ||C||`−k. 0

The algorithm fails exactly when ||I||` = 0. After this, every selection and recursive call that the algorithm makes is guaranteed to succeed. The potentials can be calculated efficiently using dynamic programming (algorithm 1).

2.3.1 A Space-Time Tradeoff

The preprocessing algorithm that was just discussed produces a data structure of size O(`) for a constant grammar, however the generation of a word with use of this data structure has a time complexity quadratic in the length of the string. Mairson also offers a method to produce an

1Terminology mine, Mairson does not give it a name

3 2 Literature Review

Initialise all cells for all nonterminals to 0 foreach Terminal production A → x ∈ P do A[1] ← A[1] + 1 for i ← 2 to ` do foreach Binary production A → BC ∈ P do A[i] ← A[i] + ∑ B[k] · C[i − k] 0

||A||` can now be found at A[`] Algorithm 1: Mairson’s algorithm for calculation of potentials auxiliary data structure with size quadratic in ` which can be later used for linear-time string generation. First we construct a grammar G0 where N0 = N × [1, `], yielding new nonterminals denoted as A1, A2 etc. Similarly productions A → BC are replaced with productions Ai → BkCi−k with 1 ≤ i ≤ ` and 0 < k < i. ∗ ∗ This means that A` ⇒ w iff A ⇒ w and |w| = `.

For each binary production A → BC and i in [1, `] a binary tree Ti[A → BC] is constructed where the leaves represent a choice for the split k and have weight ||B||k · ||C||i−k, and internal nodes have a weight equal to the sum of their children. The algorithm proceeds as before up to the choice of the production A → BC, then we use the tree to choose k. The tree is traversed from the root, with the next node chosen probabilistically according to its weight compared to its sibling. When we reach a leaf, we have our k, selected with the same probability as before, and proceed recursively as before.

2.3.2 Example: Balanced Brackets

We will use the well-known “Dyck language” of balanced brackets: S → (S)S|e, though we will use the terminals a and b instead of ( and ) to avoid notational confusion. For example, all the strings of length 6 in the language are as follows: {aaabbb, ababab, aababb, aabbab, abaabb} The grammar is first converted into CNF (process not shown) yielding:

S0 → AX|e; S → AX

X → BS|b|SY; Y → BS|b A → a; B → b

Please note that the start symbol here is S0 (produced as the first step of Chomsky-normalisation), not S.

4 2 Literature Review

We will now generate a random string of length 4 (the two possibilities being aabb and abab), though I will compute the potential table up to ` = 6 for demonstrative purposes. We follow the dynamic programming algorithm and compute table 2.1:

1 2 3 4 5 6 S0 0 1 0 2 0 5 S 0 1 0 2 0 5 X 1 0 2 0 5 0 Y 1 0 1 0 2 0 A 1 0 0 0 0 0 B 1 0 0 0 0 0

Table 2.1: Potential table of the Dyck language for ` = 6

Now we need to choose a production beginning with S0. We have a single choice, S → AX, but we still need to choose 0 < k < 4. For each k we compute ||A||k · ||X||4−k and make a probabilistic choice with these weights (to normalise the weights we would divide by ||S → AX||4, which we’d have already computed if we’d had a choice of productions). The weights for k = [1, 2, 3] are [2, 0, 0], which is not surprising as A’s only production is terminal, meaning we need three terminals from X to fulfil our quota. We recurse with starting symbol A and length 1, trivially choosing the substring a. We also recurse on starting symbol X and length 3. We have two productions to choose from: X → SY and X → BS. We compute the production potentials as detailed earlier, with 0 < k < 3, yielding 1 for both productions — an even choice. We randomly select X → SY. The weights for k = [1, 2] are [0, 1], meaning we are again required to pick a value: k = 2. We recurse on start symbol Y and length 1, and trivially choose the substring b. Recursing on starting symbol S with length 2, we can only choose S → AX with k = 1, and after recursion we trivially select the subword ab. Concatenating the recursive calls we yield aabb, a string of length 4 from the language. But we should also examine Mairson’s second approach. A full trace require a lot of pages, so instead I shall present the example tree T5[X → SY], in figure 2.1.

5 2 Literature Review

Figure 2.1: k-tree T5[X → SY] of the Dyck language. The unlabelled values are the node weights

2.4 Ambiguous String Grammars

The previously-described technique chooses a derivation tree uniformly at random. This can also be thought of as tracing a path down the total language tree by considering the weight of the subtrees at each junction. This means it only works for unambiguous grammars, where each string has at most one derivation tree (or leaf on the total language tree). We will define d(w) as the number of ways to derive w for a given grammar, unambiguity means that d(w) = 1 if S ⇒∗ w, and 0 otherwise. Determining ambiguity of a grammar, or producing an equivalent unam- biguous grammar, is undecidable in general [9, p. 404], so the user of a program to generate random structures cannot even be warned if they input an ambiguous grammar. Instead, algorithms like those of Hickey, Cohen, and Mairson will silently lose their u.a.r. guarantee, generating string w with weight d(w) instead. Bertoni, Goldwurm, and Santini show how to generate strings of a context- free grammar u.a.r. even when the grammar is ambiguous, in O(n2 log n) time [10]. Their technique requires that the language be finitely ambiguous, meaning that the grammar has a constant upper bound on the number of derivation trees for any string, i.e. ∃m ∈ L, ∀w ∈ L, d(w) ≤ d(m). This is in contrast to languages with unbounded ambiguity, where for any arbitrarily large number, there is a string in the language with at least that number of derivation trees. Initially I attempted to adapt their algorithm instead of Mairson’s, but was not successful due to difficulties with its more complex approach.

6 2 Literature Review

2.5 Hypergraphs and Hyperedge-Replacement Grammars

2.5.1 Informal Overview

A hypergraph extends the concept of graphs by allowing a “hyperedge” to attach more than two nodes. We will base our model on that defined by Engelfriet [11, §3]. From here onwards, we will habitually omit the “hyper” prefix for brevity. An edge has a label, like an ordinary edge, but instead of a “source” and “target”, it has a set of named connections, known as selectors. This is drawn as a box with undirected labelled lines (“tentacles”) from the box to each connected node. The configuration of selectors is known as that edge’s “type”, for example, an edge used in a navigational model may have the type {North, South, East, West}, and each element of that set is a selector. Other authors use an ordered sequence of attached nodes rather than a relation from selectors. I have chosen to use named selectors as I believe it is simpler to understand, rather than remembering relatively-arbitrary indices. When it comes to implementation, selector-to-node lookups will be performed with hash tables, which offer real-world performance close to array lookup by index. A graph may have any number of nodes considered “external’, which can be considered to be the interface points of the graph to the outer system that it is contained in, and these have selectors as well, allowing us to extend the concept of “type” to graphs as well as edges. Nodes that are not external are “internal”. External nodes are shown on diagrams by a selector in brackets close to the selected node (see figure 2.2). This allows us to replace any edge with a graph of the same type, “plug- ging it in” to the nodes with appropriate selectors. External nodes of the replacement graph are replaced with the appropriately-selected nodes of the replaced edge, retaining all other connections in both the outer and inner graph. The external nodes of the outer graph do not change. The graph produced by replacing edge e of H with K is notated as H[e/K]. An example is shown in figure 2.2. The connection relation of a given edge of graph does not have to be injective. Non-injectivity is used in the string embedding of section 3.5 to encode the empty string. A hyperedge-replacement grammar (HRG) has a starting symbol S and a set of productions A → K where A is a non-terminal label, and K is a

7 2 Literature Review

Figure 2.2: An example of a hyperedge replacement. This is a reproduction of figure 3.4 from [11]. The replaced edge e is the edge of H with label B. hypergraph. A crucial difference between string and graph grammars is that in a string grammar we can find any occurrence of A and replace it with the RHS, but with HRGs we can only do on edges labelled with A and with the same type as K. Luckily it is simple to produce an equivalent grammar where each nonterminal only produces graphs of the same type, for example we might split a label A into A{x,y} and A{n,s,w,e}. Under such a scheme, the start symbol should be removed from all right-hand sides, which is easily done, using an “inlining” procedure similar to that used when converting a string grammar to CNF. This allows it to produce graphs of different types without contradicting the concept of a single start symbol. When drawn as a diagram, it is helpful to draw the LHS as a “handle” graph consisting of a single edge with the label of the LHS, and connected nodes corresponding to the external nodes of the RHS. This is not essential, and adds no information that could not be inferred from the RHS, but it is easier to read, as well as forcing separation of productions of the same label but different types.

2.5.2 Formal Overview

As before, this section is based on [11, §3], with some simplifications. First we shall define Γ to be the alphabet of edge labels, and Ψ to be the alphabet of selectors. These are sets of arbitrary objects with no specific restrictions.

8 2 Literature Review

Figure 2.3: A hypergraph replacement grammar. This image shows three rules, as both S and B produce the first RHS. Note that S is not drawn as a handle, as it is a nonterminal, not a graph, which never appears on the RHS

A hypergraph over Γ and Ψ is defined as H = (V, E, lab, nod, ext), con- sisting of • V is a finite set of nodes • E is a finite set of (hyper)edges

• LAB : E → Γ assigns labels to edges

• NOD : E → (Ψ → V) assigns edges to their selection function (a partial function of applicable selectors to nodes). We will treat this as a curried function, a two-argument function, or a set of triples where appropriate to simplify notation.

• EXT : Ψ → V is the selection partial function of the graph, to its external nodes Each of these elements may be given a subscript to indicate which graph they belong to, when multiple graphs are under discussion. All graphs and their nodes are considered to be completely disjoint, with implicit isomorphic cloning and symbol renaming as appropriate. We also presume that Γ and Ψ are universal, and that any graph may use any subset of each. Let the set of all graphs hence be known as HΓΨ.

We also let TYPE be a function from all edges and graphs to Ψ, defined as as the domain of EXT for graphs and the domain of NOD(e) for edges. To perform H0 = H[e/K], where H and K share alphabets, we first define a function REMAP_EXT : (Ψ × VK) → VH0 to remap selectors that pointed to external nodes of K to point to the attachments of e.

remap_ext(ψ, v) = if v ∈ ran extK then nodH(e, ψ) else v

Then we let

• VH0 = (VH ∪ VK) \ ran extK

• EH0 = (EH ∪ EK) \{e}

9 2 Literature Review

• labH0 = (labH ∪ labK) \ (e 7→ labH(e))

• nodK0 = {(a 7→ ψ 7→ remap_ext(ψ, v)) | (a 7→ ψ 7→ v) ∈ nodK}

• nodH0 = (nodH ∪ nodK0 ) \ (e 7→ nodH(e))

• extH0 = extH A hyperedge-replacement grammar (HRG) over Γ and Ψ is defined as G = (N, Σ, P, S), where • N ⊆ Γ is a set of nonterminal edge labels • Σ ⊆ Γ is a set of terminal edge labels (disjoint from N)

• P ⊆ N × HΓΨ are the productions mapping nonterminals to graphs • S ∈ N is the starting nonterminal

H ⇒p H[e/K] where p = (A 7→ K) if lab(e) = A and type(e) = type(K)

10 3 Adaptation of String Algorithms to HRGs

3.1 Prior Work

The adaptation and implementation of a string generator to HRGs has already been examined in bachelor’s projects by Coxon [4] and Lawrence [5]. Coxon produced a Scala [12] implementation of Hickey & Cohen’s al- gorithm, with only a GUI for input and output. His source code is available to me from the department library. Lawrence later adapted Mairson’s algorithm and implemented it in Python [13], with Graphviz/Dot [14] notation for input and output. Her source code is unfortunately not available to me, and it is not clear how complete her implementation was. From her own evaluation, her program was significantly slower than Coxon’s, despite a superior algorithm, largely due to deep copying of objects. My aim here is not to merely recreate Lawrence’s work in Java, but to re-examine the problem from a fresh perspective, and to use her project for comparison.

3.2 Substrings and Concatenation for Graphs

The string algorithm selects a production A → BC and a split k, then extracts the two parts of the RHS and generates strings |b| = k and |c| = ` − k and returns their concatenation bc. But what does it mean to extract the parts of an RHS of an HRG production, and to concatenate the result? First off for the splitting, we decide an arbitrary ordering of edges when we create a graph, which is accomplished automatically in the implementation by storing them in a list. Therefore we can deterministically choose edge 1 to be B and edge 2 to be C. For intuition on concatenation, it is helpful to stop thinking of the string procedure as “return bc” and instead think “return BC[B/b][C/c]”. This

11 3 Adaptation of String Algorithms to HRGs shows how we can perform the same procedure on graphs — edges in our chosen RHS (of two hyperedges) are replaced with the recursive results, and the result is returned. Hyperedge replacement can be done in any order with the same result, as each replacement only affects one edge.

3.3 Length and Size

In order to adapt Mairson’s algorithm, we have to consider our inputs and outputs. With strings, we input a grammar, an initial nonterminal, and a desired length, where length is the number of terminals in a string, and we never consider the length of a nonterminal string. We get back a string of our desired length (or failure). Our main choice then is what “length” means for hypergraphs. I refer to this as “size”, though will maintain the use of `. It is important that we choose a definition that allows L` to be finite for any given size, otherwise u.a.r. makes no sense. We will use size = nodes + edges (for terminal graphs only), which was also used by Lawrence [5, p. 33]. The empty graph will be the graph of no nodes nor edges.

3.4 Normal Form

The input to the string algorithm had to be an unambiguous grammar in CNF. This meant that every production was either the empty production, a terminal production which increased the length of the output by 1, or a binary production which increased the length by 0. With our definition of size, this invariant is broken. A terminal production may increase the size by any positive integer, and a binary production may increase it too. We will define the size of a production |A → BC| as terminal edges + in- ternal nodes, as external nodes are removed when the graph is substituted (Lawrence gives an equivalent definition). Since we’re already having to accommodate the latter due to internal nodes, we may as well relax our restriction and allow a binary production to include any number of terminal edges too. Therefore we will require our productions to be of the form • S → x, where x is a terminal graph with |S → x| = 0 (i.e. it has no edges or internal nodes) • A → x, where x is a terminal graph with |A → x| > 0

12 3 Adaptation of String Algorithms to HRGs

• A → BC, where BC is a graph with exactly 2 nonterminal edges B and C, and any number of terminal edges or nodes. Neither B nor C may be S We also have the additional restriction that if A → BC and A → DE then type(BC) = type(DE), allowing us to replace any A-labelled edge with any RHS. S is an exception to this rule, as it does not appear on any RHS. As mentioned in section 2.5.1, the homogeneous RHS requirement can be satisfied by splitting nonterminals into one for each type of RHS it can produce (and replacing the nonterminal labels on edges with the appropriate nonterminal for that edge’s type). Dumitrescu [15] gives a procedure for converting any context-free HRG into a form of CNF for HRGs, which are a subset of the form specified here.

3.4.1 The Modified Mairson Algorithm

We now present a modified version of the Mairson algorithm to handle the normal form detailed above We first modify the preprocessing stage, as shown in algorithm 2 (a similar algorithm appears in [5, p. 34]). Specifically, we account for terminal productions of any size in the first loop, then we account for the production size in the second loop. Anywhere else that uses ||A → BC||` must be changed similarly.

Initialise all cells for all nonterminals to 0 foreach Terminal production A → x ∈ P do A[|x|] ← A[|x|] + 1 for i ← 2 to ` do foreach Binary production A → BC ∈ P do r ← i − |A → BC| A[i] ← A[i] + ∑ B[k] · C[r − k] 0

Algorithm 2: Modified algorithm for calculation of potentials

Now we choose our production from I, but we have to also accommodate terminal productions I → x where |x| = `. Clearly these all have a production potential of 1. If we choose a terminal production we are done. If we choose a binary production |I → BC| = r we proceed as before, but select 0 < k < ` − r, generate graphs |b| = k and |c| = ` − r − k, and return BC[B/b][C/c]. There is an exception, however. We defined production size as excluding external nodes, as these will be merged with an equal number of nodes

13 3 Adaptation of String Algorithms to HRGs in the host graph, however the first production applied is a special case. Since it is from the start symbol, not graph, its external nodes become the external nodes of the final graph, therefore we do need to consider their effect on the size of the final graph. Therefore we need to add an extra boolean flag to GENERATE to tell it whether it is choosing an initial production. If it is, then the size of an initial production |S → X| is its number of terminal edges and all of its nodes.

3.5 Embedding of String Grammars in HRGs

A useful exercise to test my adapted algorithm is to execute it on a system which encodes strings as hypergraphs. Engelfriet describes an embedding for string CFGs where a string of length n has nodes v0 to vn, between each vi and vi+1 is an edge with label corresponding to the string character, and connections {sou 7→ vi, tar 7→ vi+1}, and the graph has external nodes {sou 7→ v0, tar 7→ vn}. The empty string is encoded as a graph with a single node which is both sou and tar, which has the effect of merging vi and vi+1 when substituted [11, §3.1]. A source grammar that is already CNF will also be HRG-CNF (as defined by Dumitrescu [15]) when embedded. These embeddings can be used as a first-order test of a random graph generator, comparing its output to the known string generators. A string of length ` will be represented with ` edges and ` + 1 nodes, giving a graph size of 2` + 1.

14 4 Implementation

4.1 Language Choice

One of my aims in this project is to produce a fast generator. Lawrence’s project was implemented in Python [13], which yielded slow performance. Coxon’s project was written in Scala [12], which runs on the Java Virtual Machine (JVM) [2], but used the older Hickey-Cohen algorithm, which Mairson improved upon. Performance requirements largely rule out purely-functional languages such as Haskell [16], where mutation is made very difficult. Lawrence reported that a lot of time in her implementation was spent copying objects, so it is my desire to be able to mutate data structures in-place when safe to do so. Of course the best performance would be achieved by a language such as C++ [17] or Rust [18] which would put me in control of memory allocation, however this is notoriously error-prone, and graph structures cannot be managed with reference counting, therefore I consider tracing garbage collection to be a necessary evil to ensure correctness. Another personal preference is for static typing. I find this gives me a large amount of correctness testing built into the compilation stage, and allows powerful refactoring with proven safety. As a bonus, it usually results in significant performance gains due to an increased ability to resolve subroutine calls ahead-of-time. Finally, I would like to be able to easily run the program on Linux, my preferred operating system, and for it to be relatively easy for someone else to execute or continue development on it. With these requirements a language which runs on the JVM seems like a clear choice. It offers a garbage-collected environment with good performance, especially due to its use of just-in-time compilation. Coxon agreed, with his choice of Scala, however I am not very familiar with Scala and would prefer to use a well-supported language like Java [2]. Coxon highlights support for common higher-order functions as an advantage over Scala, however his report was written in 2013, before the release of Java 8 which has improved this situation. The JVM still has no support for features

15 4 Implementation of functional languages, such as higher-kinded types, but I will not need them, and besides, Scala could not support them on the JVM either. Ideally I would like to use Java 10 at minimum, as its introduction of for local variables makes code easier to read, however I have chosen Java 8 for compatibility reasons, as it is easy to install a Java 8 runtime environment or development kit on any up-to-date OS, such as the Ubuntu LTS releases used at the university. Beyond the standard library, the following tools and libraries are used: • Gradle [19] for build automation • JUnit [20] for unit-testing • JSON in Java [21] for reading user input • Shadow for Gradle [22] to produce a “fat jar” which packages the requires libraries with my code into a single executable Java archive (jar)

4.2 User Interface

The program is used as a noninteractive command-line program. The program is called with the name of the grammar file, the size of graph, number of graphs to generate, and deduplication mode (see below). Output is written to stdout, which can easily be directed to a file on a Unix-like shell. The IO format uses Javascript object notation (JSON) [3], an example is seen in figure 4.1. The program output is a JSON list of graphs. I chose JSON because it is a very widely-supported format, is simple enough to read and write by hand, and is fully self-describing1. This means it would be easy to write code to import the graphs produced by my program into another environment. A production is a JSON object with two entries: lhs — the nonterminal label (a string) — and rhs — the replacement graph (a graph in the same format as 4.1). A grammar is a list of productions. Instead of explicitly defining alphabets and a start symbol, the convention is used that a label beginning with a capital letter is nonterminal, and beginning with a lowercase letter is terminal, and anything else is rejected by the program. S is the treated specially as the start symbol of a grammar.

1Meaning the values of the encoded structure are explicitly demarcated by key, rather than appearing at a specific byte offset, which would need to be known a priori

16 4 Implementation

{ "nodes": [1, 2, 3], "externals": {"i": 1, "o": 2}, " edges " : [ { "label": "B", "connections": {"i": 1, "f": 2, "t": 3} }, { "label": "S", "connections": {"i": 3, "o": 1} } ] }

Figure 4.1: Graph H of Figure 2.2 in the program’s IO format

4.2.1 Deduplication

“Generate n graphs from this grammar” is an ambiguous request. It could be interpreted as • “Generate a single graph” n times, returning exactly n graphs, but possibly with duplicates • Generate n graphs and remove duplicates, returning between 1 and n results • Keep generating graphs until n unique graphs have been found or a timeout has been reached I have implemented the first two. The program will generate the number of graphs that the user requests, and can be instructed to ignore or remove some duplicates, which I will refer to as “deduplication”. Graph isomorphism is in NP, however there are a number of “early-exit” tricks to speed up returning a “not equal” result. First off, we store our generated graphs in a HashSet, which uses the hashCode method of its keys to determine the bucket to place the object pair in, and then further performs a test with equals to determine if any of the existing entries are the same. This gives the invariant that x.equals(y) ⇒ x.hashCode() == y.hashCode(), but not visa-versa. Therefore the hashCode function should make an attempt to map graphs to a reasonably unique integer, but should prioritise speed over uniqueness. The hashcode of Graph objects uses the common method of summing key numbers that have been multiplied by a prime, to increase uniformity of the distribution. It is determined by: • The number of nodes • The number of terminal edges • The number of nonterminal edges

17 4 Implementation

• The type of the graph (hashcode of the set of external selectors) • The number of external nodes (recall that selection need not be injective) • The number of edges for each label • The number of nodes for each configuration of incoming selectors As a further optimisation, the graphs cache their hashcode instead of recomputing it every time. Due to time constraints I was not able to implement a functioning hyper- graph isomorphism test, therefore two graphs which are equal on all of the properties listed above will be considered equal. Even if I had finished the implementation, the likelihood is that this would increase the execution time to such an extent as to make the program unusable, therefore I would have included an option to use approximate isomorphism anyway.

18 5 Evaluation

5.1 Grammars

The follow grammars use used in the next sections

5.1.1 The Palindrome Grammar

In order to provide a rough comparison with Lawrence and Coxon, I use a language that they used — that of palindromes of a binary alphabet. I have constructed the grammar myself however, as theirs did not fit my normal form. S0 → AX|BY|a|be; S → AX|BY|a|b X → SA|a; Y → SB|b A → a; B → b

Where S0 is the start symbol. This is embedded as an HRG as detailed in section 3.5.

5.1.2 The a∗|bbb Grammars

These are two embedded string grammars for the language defined by the regular expression a∗|bbb. We use two CNF grammars to produce this language.

Ambiguous Unambiguous S → e|AA|a|bbb S → e|AX|a|bbb A → AA|a X → AX|a A → a

Now consider ` = 3, giving the sublanguage {aaa, bbb}, and the possible leftmost derivations. In both cases we have the derivation S ⇒ bbb. As for aaa, in the unambiguous grammar we have S ⇒ AX ⇒ aX ⇒ aAX ⇒ aaX ⇒ aaa, but in the ambiguous grammar we have S ⇒ AA ⇒ aA ⇒∗ aaa and S ⇒ AA ⇒ Aa ⇒∗ aaa.

19 5 Evaluation

5.1.3 The “AB” Grammar

It is not sufficient to test embedded string grammars. I had originally planned to use the tree and flow chart grammars as Lawrence and Coxon had done, but was unable to understand their definitions enough to repro- duce these in my required normal form. I constructed a bespoke grammar, shown in figure 5.1. Unfortunately this grammar is ambiguous for many sizes, however it can still be used to demonstrate performance and the effect of ambiguity.

Figure 5.1: The “AB” grammar. This image shows three rules, as both S and B produce the first RHS. Not shown are two trivial rules to replace A and B with the terminal graphs a and b

5.2 Correctness

Testing correctness is a fairly simple affair. We choose a grammar and size where we know the exact size of the size-restricted language, then generate a large amount of graphs without deduplication. We then check that each unique graph generated occurs at a frequency of 1/|L`|. This counting requires sorting the produced graphs into buckets by iso- morphism. The problem arises when distinct graphs are considered approx- imately isomorphic by my program. This occurs with two anagram strings embedded as graphs. Therefore for these grammars I extracted the string back out before comparing it to previously-produced strings. I tested the palindrome grammar. The results are seen in figure 5.3, showing a maximum error of 0.4% (for aabbaa).

5.2.1 Ambiguity

A particular output graph will be produced with probability proportional to its number of derivation trees. We demonstrate this with the two a∗|bbb

20 5 Evaluation grammars for ` = 3. String aaa is produced twice as frequently as bbb due to its two parse trees, as seen in figure 5.4. We also test for AB grammar for ` = 24, where it is ambiguous. The results are shown in figure 5.5.

5.3 Performance

5.3.1 Methodology

The JVM performs just-in-time compilation of “hot spots”, and also has a high start-up time, therefore testing should be done with repeated calls to the generator within a single JVM execution. I used code similar to that shown in figure 5.2.

for ( i n t i =0; i <2; i++) { MairsonGenerator generator = new MairsonGenerator(grammar); for ( i n t size = minSize; size <= maxSize; size++) { Long startTime = System.currentTimeMillis (); generator.generateGraphs(size , iterations ); Long timeDelta = System.currentTimeMillis () − startTime ; i f ( i == 1) { // Print in CSV format System.out. printf ("%d,%d%n" , size , timeDelta); } } }

Figure 5.2: A condensed version of the code used for testing

The reason for the outer loop of exactly two iterations is the result of experimentation where I found that the first iteration always yields slower results, yet beyond this every iteration yielded very consistent results, so I decided upon using the first iteration as a “warm-up”, and the second to produce the actual results. The program uses the logic “generate one graph. If it is not null, generate n − 1 more, otherwise return null immediately”, meaning that it is able to abort very quickly when given an impossible task. These data points are omitted from the results (and hence interpolated on the line graphs). Due to time constraints I was not able to implement the k-tree method,

21 5 Evaluation meaning these results use Mairson’s slower algorithm. For the AB grammar, we are able to test deduplication. In the duplicates- permitted case, the program allocates an array of size n to store the results. While this allocation is included in the timed region, it is a constant factor that will be eclipsed by the actual generation time. For the tests with deduplication, a HashSet is used to store the results. As HashSet is a dynamically-expanding data structure, I performed two tests, the first where the set object was constructed with a default size, meaning it would have to reallocate its memory as it expanded, and the other where the object was constructed with an initial size of n. I found no significant difference in the runtime, so removed the option to preallocate the set. However this result may vary on other operating systems and processors (testing environment is detailed in the next section).

5.3.2 Results

These results were computed on an Intel i5-6500 (3.2GHz quad-core, no SMT, Skylake microarchitecture), with approximately 10GB of available physical memory. The software environment was Oracle JRE 1.8.0_212, running atop Linux 5.0.3 for x86_64. The generator uses only a single internal thread, and the test environment likewise only called the generator in a serial fashion. First I tested the AB grammar, with and without duplication, up to ` = 25, as seen in 5.6. Next I tested the language of palindromes, for a rough comparison with Lawrence and Coxon. Figure 5.7, shows a runtime slower than Coxon but faster than Lawrence. I had intended to test these right up to graph size 500, in line with both authors, however at size 251 the production weights began to overflow the long used to store them. I was unable to reproduce a Scala environment that could execute Coxon’s program, so I cannot compare our results directly due to different hardware. It is safe to presume that Coxon’s test machine was not significantly faster than mine, in fact his computer and runtime from 2013 was likely slower than mine from 2019, meaning that my implementation is slower than his.

22 5 Evaluation

Figure 5.3: Distribution of 800,000 graphs from the palindrome grammar, with graph size 13 (string length 6)

Figure 5.4: Distribution of 100,000 graphs from the two a∗|bbb grammars, with graph size 7 (string length 3)

23 5 Evaluation

Figure 5.5: Distribution of 100,000 graphs from the AB grammar, with graph size 24

Figure 5.6: Time to generate 100,000 graphs of each size 0 to 25 from the “AB” grammar. With (dashed line) and without (solid line) deduplication

24 5 Evaluation

Figure 5.7: Time to generate 100 graphs of each size 0 to 250 without deduplication from the embedded palindrome graph

25 6 Conclusion

This project was neither a complete success nor a failure. A working program was produced, however it does not appear to be a significant improvement over Coxon’s work. Despite this, the adaptation of Mair- son’s algorithm proved to be an interesting problem without simply being a replication of Lawrence’s work. This project originally started with the intent of adapting Bertoni et al.’s algorithm for finitely ambiguous grammars [10], however challenges with the mathematics of their approach led me to switch to Mairson’s algorithm fairly late in the timeframe of the project. It is possible that I could have switched sooner if I had organised my time so as to start attempting to adapt the algorithm sooner. This would likely have given me more time to use the initial results of my evaluation to improve the program further and perhaps become the fastest of the three student programs. A major change which may have achieved this would have been implementing k-trees (from section 2.3.1). Despite this, the program is a relatively clean implementation and is able to output results in JSON format, which I believe is a significant advantage. JSON can be imported into almost any programming environment with common libraries, while Coxon’s report indicates that the only output from his program is a graphical rendering. Lawrence’s program outputs files in Graphviz format, but has the downside of being considerably slower than mine or Coxon’s implementations.

6.1 Opportunities for Further Work

It should be possible to modify Mairson’s algorithm further to support any unambiguous CFG, where e is derived only from S or not at all, and S never appears on an RHS, without the need to convert it into CNF first. The final change needed is handling of any number of nonterminals in an RHS, not just 0 or 2. This would entail choosing splits k1,..., kn−1 where ∑ k = ` − r for a production of n nonterminals with a production size of r. In terms of implementation, it would be interesting to modify the code to be generic enough to be able to produce any type of structure generated by a formal grammar. These structures would need to have operations defined

26 6 Conclusion on them for properties such as size, operations such as replacement, and axioms such as commutativity of replacement. I have not done so due to time constraints, code clarity, and performance concerns. As mentioned previously, the implementation could be improved with the use of k-trees. A further optimisation would be to use multithreading inside the generator. The majority of the algorithm is an “embarrassingly parallel”1 divide-and-conquer procedure. The main reason I did not do so was unfamiliarity with threading on the implementation platform, to the extent where I would be confident that I did not have hidden race conditions. Despite this, it would be easy to execute multiple generator objects in parallel at an application level. Finally, there is opportunity for a further project to adapt Bertoni et al.’s algorithm [10], potentially permitting ambiguous grammars to be used to generate graphs u.a.r.

1A term first used by Cleve Moler [23] to refer to algorithms that naturally split into multiple independent tasks

27 Bibliography

[1] H. G. Mairson, ‘Generating words in a context-free language uni- formly at random’, Information Processing Letters, vol. 49, no. 2, pp. 95–99, 1994, ISSN: 0020-0190. DOI: https://doi.org/10.1016/0020- 0190(94)90033-7. [Online]. Available: http://www.sciencedirect.com/ science/article/pii/0020019094900337. [2] Oracle Corporation, ‘Java SE specifications’, Tech. Rep. [Online]. Available: https://docs.oracle.com/javase/specs/. [3] Ecma International, ‘The JSON Data Interchange Syntax, 2nd. ed.’, Tech. Rep., 2017. [Online]. Available: https://www.ecma-international. org/publications/files/ECMA-ST/ECMA-404.pdf. [4] J. Coxon, ‘Uniform random generation of graphs with graph gram- mars’, BSc Dissertation, University of York, 2013. [5] C. Lawrence, ‘Uniform generation of graph languages’, BSc Disser- tation, University of York, 2015. [6] C. Bak, ‘Gp 2: Efficient implementation of a graph ’, PhD thesis, University of York, 2015. [7] D. E. Knuth, The Stanford GraphBase : a platform for combinatorial computing, eng. New York, N.Y. : Reading, Mass.: ACM Press ; Addison-Wesley, 1993, ISBN: 0201542757. [8] T. Hickey and J. Cohen, ‘Uniform random generation of strings in a context-free language’, SIAM Journal on Computing, vol. 12, no. 4, pp. 645–655, 1983. DOI: 10.1137/0212044. eprint: https://doi.org/10. 1137/0212044. [Online]. Available: https://doi.org/10.1137/0212044. [9] J. Hopcroft, R. Motwani and J. Ullman, Introduction to automata theory, languages, and computation, eng, 2nd ed. Boston ; London: Addison Wesley, 2001, ISBN: 0201441241. [10] A. Bertoni, M. Goldwurm and M. Santini, ‘Random generation for finitely ambiguous context-free languages’, en, RAIRO - Theoretical Informatics and Applications - Informatique Théorique et Applica- tions, vol. 35, no. 6, pp. 499–512, 2001. DOI: 10.1051/ita:2001128. [Online]. Available: http://www.numdam.org/item/ITA_2001__35_6_ 499_0.

28 Bibliography

[11] J. Engelfriet, ‘Handbook of formal languages, vol. 3’, in, G. Rozenberg and A. Salomaa, Eds., New York, NY, USA: Springer-Verlag New York, Inc., 1997, ch. Context-free Graph Grammars, pp. 125–213, ISBN: 3-540-60649-1. [Online]. Available: http://dl.acm.org/citation. cfm?id=267871.267874. [12] École Polytechnique Fédérale de Lausanne, Scala homepage. [On- line]. Available: https://www.scala-lang.org/. [13] Python Software Foundation, Python homepage. [Online]. Available: https://www.python.org/. [14] J. Ellson et al., ‘Graphviz and Dynagraph – static and dynamic graph drawing tools’, Tech. Rep. [Online]. Available: https://graphviz.gitlab. io/_pages/Documentation/EGKNW03.pdf. [15] S. Dumitrescu, ‘Several aspects of context freeness for hyperedge replacement grammars’, WSEAS Transactions on Computers, vol. 7, Jan. 2008. [16] S. Marlow et al., ‘Haskell 2010 language report’, Tech. Rep., 2010. [Online]. Available: https://www.haskell.org/onlinereport/haskell2010/. [17] Internation Organisation for Standardization, ‘C++17 standard’, Tech. Rep., 2017. [Online]. Available: https://isocpp.org/std/the-standard. [18] The Rust Team, Rust homepage. [Online]. Available: https://www. rust-lang.org/. [19] Gradle Inc., Gradle homepage. [Online]. Available: https://gradle.org/. [20] The JUnit Team, JUnit homepage. [Online]. Available: https://junit. org/. [21] S. Leary, JSON in Java source respository. [Online]. Available: https: //github.com/stleary/JSON-java. [22] J. Rengelman, Shadow homepage. [Online]. Available: https://imperceptiblethoughts. com/shadow/. [23] C. Moler, Matrix computation on distributed memory multiprocessors. Society for Industrial and Applied Mathematics, 1986.

29