Diagrammatic Queries and Graph Databases (Extended Abstract)

Greg Butler, Ling Chen, Xuede Chen, Lugang Xu Department of Computer Science, Concordia University, 1455 de Maisonneuve Blvd West, Montreal,´ Quebec,´ Canada H3G 1M8

Abstract. Diagrams are an intuitive way for scientists to pose queries to relational, object-relational, and object databases. They allow the full range of queries, from the very simple to the very complex, to be much more easily expressed and understood than SQL-like languages or form-based queries. Diagrammatic queries are particularly appropriate for interactions as found in databases for metabolic pathways, protein-protein interactions, and gene regulation. We describe a system using graph databases for diagrammatic queries.

1 Introduction

Data management, access, and mining are at the heart of bioinformatics. While relational databases are the accepted standard within industry, there has been considerable research into deductive databases and graph databases to extend the capabilities of relational databases. Deductive databases allow a view, called the IDB (intensional database), to be defined using logical rules, and allow logical queries against the view. Since the rules allow recursive definitions, the resulting expressive power of the query language is greater than ordinary relational databases. Graph query languages are even more expressive, while having the very important property of a visual representation. The research group of Alberto Mendelzon at the University of Toronto developed the GraphLog graph query language based on hygraphs, and a visual interface, called HY+, for expressing queries and browsing their results [2]. They demonstrated applications of HY+ to querying a software repository about the structure of the code; and for management of telecommunication networks. More recently, Tony Bonner, also at the University of Toronto, has been using HY+ to assemble dna sequences. Since Mendelzon’s work, which took place from about 1985 to 1994, graph databases and so-called path queries have been intensively studied because of their application to querying semistructured data on the web. HY+ has demonstrated the ease of use of diagrammatic queries. The sample applications demonstrated that GraphLog can be efficiently implemented: the software repository example contained millions of tuples. HY+ implements GraphLog by translating the graph query into a query and view for the deductive database CORAL [8]. Workshop on Managing and Integrating Biochemical Data 2000

Our desire is to apply the beneﬁts of graph databases, diagrammatic queries, and visualization of results more broadly in genomics. However, we discovered that HY+ is written in Smalltalk using an outdated version of VisualWorks that is not compatible with the current offerings from Cincom. So, we are porting HY+ to Java using the Swing library. The initial version of the interface will use CORAL as its database engine. Even- tually, it will interface directly to our KNOW-IT-ALL framework for databases. In this paper, we present HY+ and GraphLog, and discuss our ongoing work to port HY+ from Smalltalk to Java, and to incorporate diagrammatic queries and graph databases into our KNOW-IT-ALL framework for scientiﬁc databases.

1.1 Motivation

Most data access today in genomics is provided by point-and-click interfaces on icons for canned queries, or by ﬁlling in forms for parameterized sets of canned queries, or by SQL-like textual notation for more advanced or ﬂexible querying [7]. These are solutions tailored to the underlying database technology, rather than solutions tailored to the scientists [3]. There are three central problems when providing access to genomics data for scientists.

Problem 1: Expressing queries in a way that is intuitive to scientists. HY+ and GraphLog allow scientists to construct diagrams (graphs) to express the query. The diagram depicts the entities of interest and their relationships. Icons can be designed to depict the different types of entities, and the lines representing relations can be colour-coded to ease visual comprehension of the query by the scientists.

Problem 2: Expressing complex queries. GraphLog is more expressive than relational or deductive databases. A complex query can be decomposed by deﬁning intermediate views of concepts, properties, and relations. Blobs can represent collections of entities, and path expressions en- able complex subgraphs (paths) to be shortened to one arrow with an expression for its label. The path expressions allow recursion, typically as transitive closure.

Problem 3: Exploring large, complex results of queries. The result of a query is a diagram with the same icon and colour-coding as in the query. The layout of the diagram can be controlled to enhance comprehension, ﬁlters can be applied to hide parts of the result, and the scientist can zoom in to see more detail.

Diagrams are an intuitive way for scientists to pose queries to relational, object- relational, and object databases. They allow the full range of queries, from the very simple to the very complex, to be much more easily expressed and understood than SQL- like languages or form-based queries. Diagrammatic queries are particularly appropriate G. Butler et al / Diagrammatic Queries and Graph Databases

for interactions as found in databases for metabolic pathways, protein-protein interactions, and gene regulation.

2 HY+ and GraphLog

GraphLog is a graph query language extending Datalog and negation. The language has recursion, usually as transitive closure, and has path expressions. Path expressions are similar to regular expressions. A path expression can refer to a primitive relation, or construct more complex path expressions using the operators of negation/complement, inverse, concatenation, alternation, kleene closure (*), or transitive closure (+). GraphLog is more expressive than Datalog (and SQL). HY+ is a visual Smalltalk environment for composing GraphLog queries and view- ing the results of the query. In the diagrams, nodes represent entities, while labels add identiﬁers and attributes to nodes. Edges represent relations and edge labels are path expressions. Blobs are nodes containing other nodes and represent a relation on containing and contained nodes. The underlying data structure of GraphLog and HY+ is a hygraph.

2.1 Hygraphs

A hygraph is a hybrid between Harel’s higraphs and Berge’s hypergraphs. For complete- ness, here is a formal deﬁnition.

Deﬁnition A hygraph is a septuple, ¢¡¤£¦¥¨§©£ £¦¥ ££¥£ , where ¡

is a ﬁnite set of nodes, ¥¨§

is a set of node labels,

¡ ¥¨§

is the node labelling function , ¥

is a set of edge labels, ¡ ¡ ¥

is a ﬁnite set of labelled edges, ¥¨

is a set of blob labels, and

!"¡ $# ¥¨

is a ﬁnite set of labelled blobs,

¡%&¥ # with the restriction that is a function . Workshop on Managing and Integrating Biochemical Data 2000

2.2 Diagrammatic Queries

The query interface for HY+ imports the database schema, together with iconic descrip- tions for each class of entities and relations. The icons for relations are various forms of lines and arrows. A query, or a view definition, is constructed by selecting icons and composing a hygraph. A hygraph is basically a graph augmented with blobs. A blob relates a containing node with a set of contained nodes, and can be viewed as providing an aggregation or subset relation useful for abstraction of complex graphs. The interface provides a window for the definition of views. A view may define a property of an entity, as a relation between the entity class and a “ground” symbol. A view may define a relation in terms of pre-existing concepts in the database and the views. A view may also define a blob in terms of pre-existing concepts in the database and the views. The icons for the defined property and relation arrows, and for blobs, appear in the schema window along with the icons for the database entities and relations. Closure of a relation is depicted as a dashed arrow of the same colour as the underlying relation. A complement (negation) of a relation has a cross through the arrow. The interface provides a window for the definition of queries. While composing a query, it is possible to have open the windows containing the definitions of the views that are used by the query. This, together with the schema window, guides the user to make well-formed queries.

defineGL defineGL

Reaction(X) count(C) Reaction(X) count(C) reactants_2 = products_2 = reactant 2 product 2 Compound(C) Compound(C)

defineGL Reaction(X)

reactants_2 products_2 ’ATP’ ’ADP’ Compound(C1) Compound(C2)

phosphorylate

showGL Protein(P)

phosphorylate

Figure 1: An Example Query G. Butler et al / Diagrammatic Queries and Graph Databases

Figure 1 shows the formulation of a query in HY+. The query is taken from Karp [6, p.272, query 7] which asks Find all proteins that autophosphorylate. The underlying data model has the entities reaction, compound, protein; and the relations reactant and product from reaction to compound. A protein is a special case of a chemical compound. The query formulation in the paper deﬁnes phosphorylation as a reaction where there are exactly two reactants and exactly two products, the reactants are a protein and ATP and ADP is one of the products. For the query in HY+, we deﬁne three intermediate relations: reactants 2, products 2, and phosphorylate in the top three windows of Figure 1. The

blob reactants 2 is deﬁned to be the set of compounds ' that are reactants for a reaction (

with the additional constraint that there are precisely two reactants ' . The blob prod-

ucts 2 is similarly deﬁned. The relation phosphorylate is deﬁned to hold between two

# '

compounds '*) and that satisfy Karp’s deﬁnition. The bottom window of Figure 1

+ + asks for proteins + where phosphorylate . A defineGraphLog window has a hygraph defining a view. It may define a new property (which is a relation to the “ground” entity), a new relation, or a new blob. A showGraphLog window has a hygraph expressing a query. The final database query is the union of the queries in all showGraphLog windows. A bold line indicates the focus of a definition or query.

2.3 Translation of Queries

The translation of GraphLog diagrams into logic programming notations such as CORAL or Datalog is straightforward. A node, such as Reaction(X) is translated into a fact Re- action(X) with variable X. The node label could contain specifications of constraints of attributes of a reaction, and these would be included as attributes of the fact. An edge, such as the one with label reactant above, is translated into the relation reactant(Relation(X),Compound(C)). The example below shows parts of the translation of Figure 1. Each definition or query translates to a set of rules for the focus of the window, as shown by the definition of phosphorylate. The overall query is repre- sented as the alternation of the queries in each showGraphLog window: in our example we have only one. final_query(P) -: query_1(P). query_1(P) :- phosphorylate(Protein(P),Protein(P)). phosphorylate(C1,C2) :- reactants_2(Reaction(X),Compound(C1)), reactants_2(Reaction(X),Compound('ATP')), products_2(Reaction(X),Compound('ADP')), products_2(Reaction(X),Compound(C2)).

Path expressions are easily expanded into logic programs. For example, the transitive closure (+) of the phosphorylate relation could be expressed as Workshop on Managing and Integrating Biochemical Data 2000

tc_phosphorylate(X,Y) :- phosphorylate(X,Y). tc_phosphorylate(X,Y) -: phosphorylate(X,T), tc_phosphorylate(T,Y).

The hard part is ensuring efﬁcient processing of the translated query. There has been no experimentation with CORAL optimization strategies, and one really needs multidimensional indexes of the underlying relations to support efﬁcient processing of the deductive programs.

2.4 Presentation of Query Results

Hygraphs are also the means for presentation of results of queries. The presentation is controlled initially by specifying a layout algorithm for the hygraph in a layoutGraphLog window, and by specifying which entities or relations to hide in a hideGraphLog window. As before, the use of icons and colours aids comprehension of the hygraphs. HY+ also provided the ability to zoom in for more detail. The window for the presentation of results displays the hygraphs of the results, and provides tools for abstraction by collapsing blobs. It is very natural in HY+ to refine queries by simply regarding the current query as a view definition, and then composing a query (a refinement) against the database and using the new view. Of course, there is the potential to use the set of results as a materialized view.

Know−It−All Framework

Generic Infrastructure

DataModel Specific DataModel Query Language

Query Optimization

Indexing Techniques

Physical Storage

Figure 2: Overview of the KNOW-IT-ALL Framework

3 The KNOW-IT-ALL Framework

KNOW-IT-ALL is an object-oriented framework for database management systems. It is written in C++, with some Java for user interfaces, and XML for communication of data between the C++ framework and the Java tools. The user interfaces will provide a full range of query mechanisms, from icons for canned queries, to forms, to textual queries in set comprehension languages, and diagrams. G. Butler et al / Diagrammatic Queries and Graph Databases

KNOW-IT-ALL is designed with scientiﬁc databases in mind, and does not provide for transactions. Instead, it provides a data feed mechanism for bulk or incremental data loads. The prime concern is querying of existing data. The framework provides a generic infrastructure for database management systems and allows them to support a range of data models (relational, object, object-relational, etc) where the data model itself, and its constituents for query language, query optimizing, indexing, and storage have clearly deﬁned roles (see Figure 2).

Application/GUI Database ViewDB Layer ConceptualDB Layer LogicalDB Layer PhysicalDB Layer

Figure 3: Layer View of a Database

A database in KNOW-IT-ALL is seen as a series of layers, each of which provides the same interface. The usual breakdown of responsibilities into physical, logical, conceptual, and view layers is followed by KNOW-IT-ALL, as shown in Figure 3. However, a database, as seen by the end-user, allows views of views, and mappings of object conceptual models to relational conceptual models. Eventually, KNOW-IT-ALL will incorporate composite databases (such as integrated or heterogeneous databases) and make no dis- tinction between simple and composite databases.

Next Layer

Query Results

Language Schema Translate ReConstruct Layer produce

Language Query Results

Next Layer

Figure 4: Basic Building Block

Each layer in KNOW-IT-ALL is basically a translator between its client layer and its supplier layer(s), as shown in Figure 4. A layer provides a mechanism to decompose or translate queries, and a mechanism to reconstruct answers (for example, an execution plan Workshop on Managing and Integrating Biochemical Data 2000

for relational algebra expressions). The translation is done with the aid of the schema, and produces both the translated query, and the mechanism to reconstruct answers. The KNOW-IT-ALL framework contains a subframework for query optimization, and a subframework for indexing techniques. The optimization framework is based on the broadly applicable OPT++ [5]. The indexing subframework is based on GIST [4], which covers tree-based indexes, including multi-dimensional trees and similarity-based retrieval. The indexing subframework will have to be extended to cover inverted ﬁles and hashing techniques. Our prototype implements the relational data model. For support of diagrammatic queries, it is a priority to support deductive and graph databases. For general needs in genomics, we need to support object databases. This will allow us to support spatial, tem- poral, and image databases. There are now also algebras for object-relational databases, so they will also be supported.

schema .

. .

GraphLog

GraphLog View DB

Coral

Coral View DB optimize Coral

schema Relational Algebra Relational Conceptual DB

Relational Algebra

Relational Logical DB optimize use of indexes

multidimensional Relational Physical DB indexes

Figure 5: HY+ and GraphLog in the KNOW-IT-ALL Framework

4 Conclusion and Future Work

While the initial version of the Java HY+ interface will be implemented using CORAL, we will eventually incorporate it as part of our framework. The ﬁrst step is to incorporate deductive databases, in the form of CORAL, into the KNOW-IT-ALL framework for G. Butler et al / Diagrammatic Queries and Graph Databases

databases. This will be done by regarding the deductive database as a VIEWDB subclass deﬁning an intensional database view of a relational database. See Figure 5. The second step is to incorporate graph databases directly into the KNOW-IT-ALL framework, also as a VIEWDB subclass of either relational, object, or object-relational databases. Then we will incorporate recent techniques for the processing of path queries. We will explore other visualization tools, such as Java3D for 3-dimensional display, and Jazz [1] for zoomable displays, for the presentation of query results. Genomics needs intuitive query notations. HY+ and GraphLog are proven technolo- gies. Our work will provide

a (standalone) Java user interface for hygraphs and GraphLog; and

improved graph database technology in KNOW-IT-ALL, and hence improved per- formance for diagrammatic queries. This is ongoing work to provide a database of pathways in yeast and other fungi, with particular application to the regulation of starch and sucrose metabolism in Aspergillus niger.

Acknowledgements This work has been supported by NSERC of Canada and FCAR of Quebec.´ Dis- cussions with Gosta Grahne on graph databases and path queries are gratefully acknowledged. The Centre for Structural and Functional Genomics at Concordia University provides the genomics focus for our work in bioinformatics.

References

[1] Ben Bederson, Jazz: Zoomable User Interface Toolkit for Java, (http://www.cs.umd.edu/hcil/jazz) [2] M.P. Consens, F.Ch. Eigler, M.Z. Hasan, A.O. Mendelzon, E.G. Noik, A.G. Ryman, and D. Vista, Architecture and applications of the Hy+ visualization system, IBM Systems J. 33:3 (1994), pp. 458- 476. [3] Dimitrij Frishman, Klaus Heumann, Arthur Lesk, Hans-Werner Mewes, Comprehensive, comprehen- sible, distributed and intelligent databases: Current status, Bioinformatics, 1998, vol. 14, No. 7, pp. 551–561. [4] J.M. Hellerstein, J.F. Naughton, A. Pfeffer, Generalized search trees for database systems. In VLDB1995 (Proceedings of 21th International Conference on Very Large Data Bases, Sept. 11–15, 1995, Zurich), 1995, pp. 562–573. [5] N. Kabra, D. J. DeWitt, OPT++: An object-oriented implementation for extensible database query optimization. VLDB Journal 8,1 (1999) 55–78. [6] P.D. Karp, An ontology for biological function based on molecular interactions. Bioinformatics 16, 3 (2000) 269–285. [7] Stanley I. Letovsky. Bioinformatics: Databases and Systems. Kluwer Academic Publishers, Boston, 1999. [8] R. Ramakrishnan, D. Srivastava, S. Sudarshan, P. Seshadri, The CORAL deductive system. VLDB Journal 3,2 (1994) 161–210.