Developing and Certifying Datalog Optimizations in Coq/Mathcomp

Developing and Certifying Datalog Optimizations in Coq/MathComp Pierre-Léo Bégay Pierre Crégut Jean-François Monin Orange Labs Orange Labs Univ. Grenoble Alpes Lannion, France Lannion, France CNRS, Grenoble INP, VERIMAG Univ. Grenoble Alpes [email protected] Grenoble, France CNRS, Grenoble INP, VERIMAG jean-francois.monin@univ-grenoble- Grenoble, France alpes.fr [email protected] Abstract new facts. Recursivity makes it possible to compute transitive We introduce a static analysis and two program transfor- closures, e.g. accessibility in graphs, in a simpler and more mations for Datalog to circumvent performance issues that complete way than other query languages, such as SQL, arise with the implementation of primitive predicates, no- XPath and SPARQL. In contrast to Prolog, the evaluation tably in the framework of a large scale telecommunication mechanism of Datalog follows a bottom-up strategy which application. To this effect, we introduce a new trace seman- guarantees termination [2] – in this framework, the set of tics for Datalog with a verified mechanization. This work derivable facts is always finite. can be seen as both a first step and a proof of concept for the Originally designed as a powerful query language on creation of a full-blown library of verified Datalog optimiza- databases, it has recently gained interest thanks to domain- tions, on top of an existing Coq/MathComp formalization of specific extensions [2, 29]. The introduction of [5] provides a Datalog[5, 14] towards the development of a realistic envi- comprehensive list of languages built upon Datalog [3, 9, 16, ronment for certified data-centric applications. 27] and applications, in both academic [13, 18] and industrial CCS Concepts: • Security and privacy ! Logic and ver- [11, 15] settings. ification; • Theory of computation ! Program analysis; This work originates in a large-scale application of Datalog • Networks ! Network dynamics. to a telecommunication verification tool aiming at computing connectivity properties in virtual infrastructure managers Keywords: Datalog, Coq, MathComp, semantics. such as OpenStack. Datalog can model traffic forwarding in ACM Reference Format: various networking elements in only a few lines of code. Pierre-Léo Bégay, Pierre Crégut, and Jean-François Monin. 2021. Developing and Certifying Datalog Optimizations in Coq/Math- Lopes et al have shown in [25, 26] that the Datalog engine Comp. In Proceedings of the 10th ACM SIGPLAN International Con- must use specific representations of set of values rather than ference on Certified Programs and Proofs (CPP ’21), January 18– enumeration to efficiently handle common operations on 19, 2021, Virtual, Denmark. ACM, New York, NY, USA, 15 pages. network addresses. They introduced such a structure, called https://doi.org/10.1145/3437992.3439913 cubes, and the Network Optimized Datalog engine (NoD), that scales to models of networks of industrial size. Although 1 Introduction a step in the right direction, NoD imposes constraints on programs to scale. Efficient NoD programs (e.g. [24]) are Datalog is a simple and declarative language tuned to data- usually long and complex. The transformation process to centric applications. As a first approximation, it is a fragment generate those programs from the generic ones is manual, of Prolog without function symbols. A Datalog program error prone, undocumented, difficult to understand and trust consists of facts, i.e. positive ground atoms, and implicitly by a third-party, and ultimately difficult to maintain when universally quantified rules which allow the derivation of the initial program evolves. Publication rights licensed to ACM. ACM acknowledges that this contribu- To address these performance and trust issues, we de- tion was authored or co-authored by an employee, contractor or affiliate of velop two Datalog-level program transformations aimed at a national government. As such, the Government retains a nonexclusive, speeding-up execution by the NoD engine, as well as a static royalty-free right to publish or reproduce this article, or to allow others to analysis upon which the first transformation relies. We use do so, for Government purposes only. CPP ’21, January 18–19, 2021, Virtual, Denmark and extend the Coq/MathComp formalization of Datalog © 2021 Copyright held by the owner/author(s). Publication rights licensed developed in [5, 14], to show that the static analysis captures to ACM. an overapproximation of the behavior of a Datalog program, ACM ISBN 978-1-4503-8299-1/21/01...$15.00 and that the two rewritings preserve the semantics of the https://doi.org/10.1145/3437992.3439913 CPP ’21, January 18–19, 2021, Virtual, Denmark Pierre-Léo Bégay, Pierre Crégut, and Jean-François Monin transformed program. These proofs lead us to design a new Another restriction, called safety, requires all variables in Datalog trace semantics, whose implementation is also ver- the head of a rule to appear in its body, ensuring that only a ified in Coq. The resulting code, which is made available finite number of new facts can be deduced. at [1], can be seen as a first and realistic experiment of the Term body occurrences, or C>22 are 3-tuples in N3. The aforementioned Coq implementation of Datalog. components are the indexes of, respectively, the rule, the Section2 recalls the basics of Datalog and discusses the atom (within the body of the rule) and the argument (within limitations of the NoD engine. Sections3 and4 present and the atom), starting at 0. In Example 2.1, the C>22 for the / justify our program transformations and the static analysis. within 4364 in the second rule would be h1, 1, 0i. Section5 introduces and discusses their Coq formalizations and justifications. We sketch out in Section6 a more efficient 2.2 Semantics version of the static analysis, as well as the main difficulty it raises. Section7 explains the effects of those optimizations, B¹%º is the Herbrand Base of program %, i.e. the set of ground in particular in our use case. We finally discuss related works atoms built from its predicates and constants. An interpre- in Section8 and conclude in9. tation 퐼 is a subset of B¹%º. A substitution a, i.e. a mapping from variables appearing in the program to (program) con- 2 Datalog stants, can naturally be lifted to clauses. We denote the set of substitutions as Σ. We first present the formal syntax and semantics of Datalog. A clause 퐶 is satisfied by an interpretation 퐼 if, for any We then introduce the Network Optimized Datalog engine, substitution a, 1>3~¹a ¹퐶ºº ⊆ 퐼 ) ℎ푒03¹a ¹퐶ºº 2 퐼. Lift- and in particular discuss a caveat with the implementation ing this notion to full programs, 퐼 is a model of % iff all its of certain predicates. clauses are satisfied by 퐼. The semantics of % is its unique minimal model w.r.t. set inclusion [2], written MP. However, 2.1 Syntax this model-theoretic semantics provides no clue on how to actually build MP. To do so, the following functions are used: Assuming sets V, C and P of variables, constants and predicate symbols, programs are built using the following rules: Definition 2.2. (Substitution / clause matching) A sub- a 퐶 퐼 %A>6A0<B % ::= 퐶 , ··· ,퐶 stitution matches a clause w.r.t. an interpretation , writ- 0 : ten <0C2ℎ¹a,퐶, 퐼º, iff a maps all atoms from the body of 퐶 to 퐶;0DB4B 퐶 ::= 퐴 :- 퐴 , ··· , 퐴 . 0 1 < elements of 퐼, i.e. 1>3~¹a ¹퐶ºº ⊆ 퐼. 퐴C><B 퐴 ::= ?¹C0, ··· ,C=−1º )4A<B C G 2 ::= 2 V j 2 C Definition 2.3.( )% – Consequence operator) Let % be a 퐼 ) 퐴 is the head of the clause, whereas the 퐴 ...퐴 sequence program and an interpretation. The % operator adds the 0 1 < 퐼 is its body. In the third rule, ? 2 P and = is the arity of ?, set of program consequences to : ) 퐼 ℎ푒03 a 퐶 a 퐶 % <0C2ℎ a,퐶, 퐼 퐼 written 0A ¹?º. We will write as ℎ푒03 : 2;0DB4 ! 0C>< and % ¹ º = f ¹ ¹ ºº j 2 Σ ^ 2 ^ ¹ ºg [ 0C>< 1>3~ : 2;0DB4 ! 2 the functions that return the head Definition 2.4. (Fixpoint evaluation) The iterations of and body of a clause. the )% operator on ; are: Example 2.1. Figure1 shows a Datalog program fragment, ( )% " 0 = ; which computes connectivity in a graph (the ;8=:43 predi- ) " = ¸ 1 = ) ¹) " =º cate) as the transitive closure of the 4364 relation. % % % Since )% is monotonic and bound by B¹%º, the Knaster- ;8=:43¹-,.º :- 4364¹-,.º. Tarski theorem [34] ensures the existence of a least fixed l,) l Ð ) = ;8=:43¹-,.º :- ;8=:43¹-,/º, 4364¹/,.º. point, i.e. 9 % " = =≥0 % " . The fixpoint evaluation of % is defined as )% " l = lfp ¹)% º. It is shown in [36] that lfp ¹)% º = MP, ensuring that )% is an adequate Figure 1. Graph connectivity in Datalog mechanization of Datalog model-theoretic semantics. Example 2.5. Let % be the graph connectivity program from A program contains a mix of ground bodyless clauses, Example 2.1 augmented with the set of facts 퐹 = f4364¹1, 3º, called facts, and rules, i.e. clauses with at least one atom in 4364¹2, 1º, 4364¹4, 2º, 4364¹2, 4ºg, then their body. A predicate is defined only by rules (intensional )% " 0 = ;; )% " 1 = 퐹 predicates) or by facts (extensional predicates). In Example )% " 2 = )% " 1 [ f;8=:43¹1, 3º, ;8=:43¹2, 1º, 2.1, ;8=:43 is intensional, whereas 4364 (not shown) is con- hspace ;8=:43¹4, 2º, ;8=:43¹2, 4ºg; sidered as extensional, the facts would then represent the )% " 3 = )% " 2 [ f;8=:43¹2, 3º, ;8=:43¹4, 1º, set of edges in the graph. hspace ;8=:43¹4, 4º, ;8=:43¹2, 2ºg; Developing and Certifying Datalog Optimizations in Coq/MathComp CPP ’21, January 18–19, 2021, Virtual, Denmark )% " 4 = )% " 3 [ f;8=:43¹4, 3ºg = )% " 5; 2.4 Handling More Genericity The minimal model of % is MP = ; 5 ? ¹)% º = )% " 4.

Load more