Generation of Uniformly-Random Graphs
Total Page:16
File Type:pdf, Size:1020Kb
Department of Computer Science Submitted in part fulfilment for the degree of BSc. Generation of Uniformly-Random Graphs Mark Henrick 29th of April 2019 Supervisor: Detlef Plump Contents Executive Summary iv 0.1 Ethics . v 1 Introduction 1 2 Literature Review 2 2.1 Preliminaries and Notation . 2 2.2 Generation of Strings u.a.r. 2 2.3 Mairson’s Methods . 3 2.3.1 A Space-Time Tradeoff . 3 2.3.2 Example: Balanced Brackets . 4 2.4 Ambiguous String Grammars . 6 2.5 Hypergraphs and Hyperedge-Replacement Grammars . 7 2.5.1 Informal Overview . 7 2.5.2 Formal Overview . 8 3 Adaptation of String Algorithms to HRGs 11 3.1 Prior Work . 11 3.2 Substrings and Concatenation for Graphs . 11 3.3 Length and Size . 12 3.4 Normal Form . 12 3.4.1 The Modified Mairson Algorithm . 13 3.5 Embedding of String Grammars in HRGs . 14 4 Implementation 15 4.1 Language Choice . 15 4.2 User Interface . 16 4.2.1 Deduplication . 17 5 Evaluation 19 5.1 Grammars . 19 5.1.1 The Palindrome Grammar . 19 5.1.2 The a∗jbbb Grammars . 19 5.1.3 The “AB” Grammar . 20 5.2 Correctness . 20 5.2.1 Ambiguity . 20 5.3 Performance . 21 5.3.1 Methodology . 21 ii Contents 5.3.2 Results . 22 6 Conclusion 26 6.1 Opportunities for Further Work . 26 iii Executive Summary This project aims to produce a program which accepts as input an un- ambiguous context-free hyperedge replacement grammar and produces random hypergraphs of a given size1. The important property is that these hypergraphs should be generated uniformly at random (u.a.r.), meaning that every hypergraph of the specified size that the grammar produces should be generated with equal probability. A primary use case of this software is to produce random inputs for testing graph algorithms, there- fore the generator should write its results in a computer-readable format suitable to be used in other programs. The only similar programs found in the literature review are past BSc projects, which use slightly different algorithms or software platforms than this project. In chapter 1 further background on the state of the art is given. Chapter 2 explores the existing string algorithms and explains hypergraphs and hyperedge replacement grammars (HRGs). Chapter 3 details the adaptation of a string algorithm by Harry Mairson [1] to HRGs. While the original algorithm required the input be in Chom- sky normal form, these restrictions are partially relaxed in the process of adapting the algorithm to account for the properties of hypergraphs. In chapter 4 the implementation is covered, including software platform choice and a discussion of how to remove many duplicate graphs from the output in an efficient manner. The algorithm is implemented in Java [2], and results are rendered in JSON [3], a widely-supported data interchange format. The evaluation of the program is detailed in chapter 5. The program produces graphs with the expected distribution, but unfortunately is found to be generally slower than a program produced by Jake Coxon [4], while performing faster than one produced by Carla Lawrence [5]. The report is concluded in chapter 6, which details some areas for further work. 1The size of a hypergraph is the sum of the numbers of hyperedges and nodes iv Executive Summary 0.1 Ethics As this project is purely adapting and implementing a rather abstract math- ematical algorithm, there are no direct ethical implications. As usual, academic integrity must be maintained, and is of heightened importance due to the existence of similar student projects in the area. v 1 Introduction Graphs are one of the most ubiquitous and versatile data structures in computer science and discrete mathematics. As they can be used to model a large number of problems, there is considerable interest in the manipulation of graphs, resulting in programming languages designed specifically for that purpose, such as GP2 [6]. A concern for all software development is testing, which can take the form of formal verification, hand-written assertions, or generating random tests cases that are checked for certain invariants. The aim of this project is to develop a method which can be used to generate graphs derived from a specific grammar uniformly at random (u.a.r.). These outputs can then be used as random inputs to a graph algorithm, allowing semi-automatic testing. Random graph generators do exist, such as Stanford GraphBase [7], however these are a lot less powerful than what is needed for this project. The existing generators primarily generate “ordinary” graphs — we will be generating hypergraphs — and rarely give much control over the “shape” of the graph. We will be generating graphs from hyperedge-replacement grammars, which allow powerful specification of graph languages. Prior work in the area of uniform generation from grammars has primarily focused on strings, and for these it is mostly a solved problem, as detailed in section 2.2, however work extended this to graphs is limited, as detailed in section 3.1. This project aims to add to this prior work with a new hypergraph generator for the Java platform, using a variant of an existing string algorithm by Mairson (detailed in section 2.3), which is faster than a similar program written by Lawrence (see again section 3.1). 1 2 Literature Review 2.1 Preliminaries and Notation This report presumes rudimentary knowledge of context-free string gram- mars. We will define a context-free grammar (CFG) for strings as G = (N, S, P, S) where N is a set of nonterminals (variables), S is the terminal alphabet (disjoint from N), P ⊆ N × (N [ S)∗ is the set of productions (rules) and S 2 N is the start symbol. L is the language generated by the grammar and we write L` for the ` sublanguage restricted to a specific string length, L` = L \ S . Note that ` while L may be (countably) infinite, L` is finite with cardinality at most jSj . We will write terminals as lowercase and nonterminals as capitals. ) notates direct derivation and )∗ means derivation by any number of steps. e denotes the empty string, and ` will be used throughout to denote the length of the string, or size of the hypergraph, which we wish to generate. 2.2 Generation of Strings u.a.r. The problem of generating strings uniformly at random (u.a.r.) from a context-free grammar (CFG) has received substantial attention. We formally specify the problem as follows: given inputs of a CFG G and a length ` > 0, describe an algorithm to select a string from L` with probability 1/jL`j. Methods based on choosing available productions u.a.r. will not work, as strings with shorter derivations will be more likely to be generated. Intuitively one can think of a total language tree — this approach would only work if it were perfectly balanced. Hickey and Cohen [8] present two algorithms for unambiguous grammars. This work is improved upon by Mairson [1], who presents two algorithms that give a tradeoff between linear time via use of a quadratic-size data structure v.s. quadratic time and linear space. 2 2 Literature Review 2.3 Mairson’s Methods Mairson presumes an unambiguous grammar G that is in Chomsky normal form (CNF). This means any production is of the form A ! BC (which I will call “binary productions”), A ! x (which I will call “terminal productions”), or S ! e (which I will call “the empty production”). There is a well-known terminating algorithm to convert any CFG to this form. Generation is considered with regard to a certain starting symbol, which may not be the “global” S of the grammar — we will refer to it as I for “initial”. If ` < 2, we simply choose a random production I ! x where jxj = ` and return x (in practice the only possibilities for ` = 0 are S ! e or failure). This leaves the case of ` > 1. Mairson defines the “potential”1 of a symbol for a given length, denoted as jjAjj` as the number of strings of length ` that can be derived from A in any number of steps. This can be defined as the number of strings that can be directly derived from A (which will all be of length 1), plus the potential of each production with an LHS of A. The production potential jjA ! BCjj` is the number of strings of length ` with a derivation starting with A ! BC. To generate such a string we have a choice of where to split the length of the string generated from B, and that from C , as long as the lengths are positive and sum to `. In other words, jjA ! BCjj` = ∑ jjBjjk · jjCjj`−k. 0<k<` jjI ! BCjj Now we choose a production I ! BC with probability ` and jjIjj` jjBjj · jjCjj the split 0 < k < ` with probability k `−k , then recurse on (B, k) jjI ! BCjj` and (C, ` − k), and return the concatenated result. The algorithm fails exactly when jjIjj` = 0. After this, every selection and recursive call that the algorithm makes is guaranteed to succeed. The potentials can be calculated efficiently using dynamic programming (algorithm 1). 2.3.1 A Space-Time Tradeoff The preprocessing algorithm that was just discussed produces a data structure of size O(`) for a constant grammar, however the generation of a word with use of this data structure has a time complexity quadratic in the length of the string. Mairson also offers a method to produce an 1Terminology mine, Mairson does not give it a name 3 2 Literature Review Initialise all cells for all nonterminals to 0 foreach Terminal production A ! x 2 P do A[1] A[1] + 1 for i 2 to ` do foreach Binary production A ! BC 2 P do A[i] A[i] + ∑ B[k] · C[i − k] 0<k<i jjAjj` can now be found at A[`] Algorithm 1: Mairson’s algorithm for calculation of potentials auxiliary data structure with size quadratic in ` which can be later used for linear-time string generation.