Empirical Study on the Usage of Graph Query Languages in Open Source Java Projects
Total Page:16
File Type:pdf, Size:1020Kb
Empirical Study on the Usage of Graph Query Languages in Open Source Java Projects Philipp Seifer Johannes Härtel Martin Leinberger University of Koblenz-Landau University of Koblenz-Landau University of Koblenz-Landau Software Languages Team Software Languages Team Institute WeST Koblenz, Germany Koblenz, Germany Koblenz, Germany [email protected] [email protected] [email protected] Ralf Lämmel Steffen Staab University of Koblenz-Landau University of Koblenz-Landau Software Languages Team Koblenz, Germany Koblenz, Germany University of Southampton [email protected] Southampton, United Kingdom [email protected] Abstract including project and domain specific ones. Common applica- Graph data models are interesting in various domains, in tion domains are management systems and data visualization part because of the intuitiveness and flexibility they offer tools. compared to relational models. Specialized query languages, CCS Concepts • General and reference → Empirical such as Cypher for property graphs or SPARQL for RDF, studies; • Information systems → Query languages; • facilitate their use. In this paper, we present an empirical Software and its engineering → Software libraries and study on the usage of graph-based query languages in open- repositories. source Java projects on GitHub. We investigate the usage of SPARQL, Cypher, Gremlin and GraphQL in terms of popular- Keywords Empirical Study, GitHub, Graphs, Query Lan- ity and their development over time. We select repositories guages, SPARQL, Cypher, Gremlin, GraphQL based on dependencies related to these technologies and ACM Reference Format: employ various popularity and source-code based filters and Philipp Seifer, Johannes Härtel, Martin Leinberger, Ralf Lämmel, ranking features for a targeted selection of projects. For the and Steffen Staab. 2019. Empirical Study on the Usage ofGraph concrete languages SPARQL and Cypher, we analyze the Query Languages in Open Source Java Projects. In Proceedings activity of repositories over time. For SPARQL, we investi- of the 12th ACM SIGPLAN International Conference on Software gate common application domains, query use and existence Language Engineering (SLE ’19), October 20–22, 2019, Athens, Greece. of ontological data modeling in applications that query for ACM, New York, NY, USA, 15 pages. https://doi.org/10.1145/3357766. concrete instance data. Our results show, that the usage of 3359541 graph query languages in open-source projects increased over the last years, with SPARQL and Cypher being by far 1 Introduction the most popular. SPARQL projects are more active in terms Graph data models are interesting in various domains, in of query related artifact changes and unique developers in- part due to the flexibility and intuitiveness they offer. Unlike volved, but Cypher is catching up. Relatively few applications the relational data model, schemata may be incrementally use SPARQL to query for concrete instance data: A majority developed and adapted, or even omitted entirely. The graph of those applications employ multiple different ontologies, model is also intuitive for modeling tasks, for example in general knowledge as exemplified by the Wikidata Knowl- Permission to make digital or hard copies of all or part of this work for edge Graph [70]. Specialized query languages facilitate the personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear use of the graph data model. A long-standing standard is this notice and the full citation on the first page. Copyrights for components SPARQL [57], a query language for RDF graphs which con- of this work owned by others than ACM must be honored. Abstracting with ceptually matches on the triple structure of RDF. For prop- credit is permitted. To copy otherwise, or republish, to post on servers or to erty graphs, languages like Cypher [27] emerged, featuring redistribute to lists, requires prior specific permission and/or a fee. Request specialized constructs to build results by selecting nodes permissions from [email protected]. and edges, either of which may potentially hold properties. SLE ’19, October 20–22, 2019, Athens, Greece © 2019 Association for Computing Machinery. While the major graph-related query approaches have been ACM ISBN 978-1-4503-6981-7/19/10...$15.00 compared in terms of various properties of both language [4] https://doi.org/10.1145/3357766.3359541 and implementation [2, 34], we are interested in the practical SLE ’19, October 20–22, 2019, Athens, Greece Philipp Seifer, Johannes Härtel, Martin Leinberger, Ralf Lämmel, and Steffen Staab usage of such technologies. The focus of this study, therefore, both cases we do not expect representative results in open- is the comparison of overall popularity and practical use of source projects. Our study therefore focuses on the following graph-related query languages in real-world projects. More four graph-related query languages. Examples are given in specifically, we analyze open-source Java projects on GitHub. Figure1, in the context of a simple graph consisting of person Our primary contributions follow. nodes and knows edges. 1. We compare the overall popularity of graph-related query languages in open-source projects, in terms of SELECT ?p ?k WHERE { numbers of projects using the query languages, num- ?p a foaf:Person ; foaf:knows ?k . bers of affected methods and files, numbers of devel- ?k a foaf:Person opers involved and the changes of these numbers over time. } 2. We use a refined, usage criteria-based approach to ana- lyze usage of query languages in open-source projects, MATCH (p: Person) − [ :KNOWS] − >(k:Person) in terms of distinction of framework-like functionality RETURN p , k versus concrete query usage and in terms of determin- ing usage of ontologies. The remainder of this paper is structured as follows. Section2 g . V ( ) motivates the selection of query languages we analyze and . h a s L a b e l ( ' person ' ) . as ( ' p ' ) gives a short overview of these technologies. In Section3, . outE ( ' knows ' ) . inV ( ) we introduce our research questions in depth and design . h a s L a b e l ( ' person ' ) . as ( ' k ' ) the methodology of our study, to then present our results . s e l e c t ( ' p ' , ' k ' ) in Section4. We conclude with related work in Section5, before our summary and future work in Section6. person { name 2 Graph Query Languages knows { name } We restrict our study to popular graph-related query lan- } guages. Our selection is primarily based on two factors. First, we consider languages popular enough to occur in Figure 1. Example queries selecting people and the people the GitHub Linguist classification [29]. For all 461 candidate they know in SPARQL (1st), Cypher (2nd), Gremlin (3rd) and 1 languages we manually identify query languages by con- GraphQL (4th). sidering resources such as project websites, GitHub reposi- tories, or otherwise related Wikipedia entries. We classify the following 11 languages as query related: Flux, GraphQL, SPARQL is a W3C standardized query language for RDF HiveQL, JsonIQ, Lasso, PLSQL, SPARQL, SQL, SQLPL, TSQL graphs, which at its core matches the triple structure (subject, and XQuery. Here SPARQL and GraphQL [21, 30] are pri- predicate and object) of RDF data. SPARQL is strongly asso- marily related to graphs and therefore included in our com- ciated with the semantic web and commonly used conjointly parison. As a baseline we also refer to XQuery [59] (an XML with ontologies. database query language) and SQL. We suspect not all interesting query languages to be listed Cypher is a declarative query language for property graphs, in Linguist, especially since not all languages must pro- initially developed for the Neo4j platform. Since 2015, the duce dedicated artifacts. Therefore, we use our identified language is standardized by the openCypher [53] project. languages as seed languages to find related work on graph- Cypher queries rely on matching graph patterns expressed related query languages. Such work (e.g., [4, 34]) frequently with ASCII-Art syntax. refers to both Cypher [27, 52], the property graph query Gremlin is a graph traversal language with a focus on language initially developed for Neo4j [51], and the Gremlin path-like expressions. It is the core graph matching language graph traversal language [25, 60]. We do not include lan- of the Apache Tinkerpop [24] framework. guages exclusive to proprietary graph databases, or more recent developments in research, such as G-CORE [3]. In GraphQL Unlike the other three query languages, the pri- mary focus of GraphQL is not purely database access. Instead, 1 For the 519 languages that occur in this list, we consider all languages GraphQL was developed as a graph-based replacement for classified as type: data (95, e.g., SQL) or type: programming (366, e.g., C), while excluding those with type classifications of markup (45, e.g., HTML) the REST paradigm. In GraphQL, queries consist of JSON-like and prose (13, e.g., Markdown). We include programming languages since templates that are matched on server side schema definitions, some languages such as XQuery are not classified as data languages. usually in the context of web APIs. Empirical Study on the Usage of Graph Query Languages in Open Source Java Projects SLE ’19, October 20–22, 2019, Athens, Greece 3 Methodology Table 1. Number of dependencies per query language, num- Our empirical study targets open-source projects, with a ber of distinct groups (i.e., project names) and average num- ∼ scope restricted to Java projects and the GitHub platform. ber of dependencies per group (where marks estimations). This section defines the two research questions we answer, as well as the design of our study in terms of data extraction Language Dependencies Groups Per Group and analysis. With our first research question we aim at SPARQL 191 72 2.65 understanding popularity and usage of the selected graph- Cypher 80 14 5.71 based query languages.