SynopSys: Large Graph Analytics in the SAP HANA Through Summarization

Michael Rudolf1 Marcus Paradies1 Christof Bornhövd2 Wolfgang Lehner3

1SAP AG 2SAP Labs, LLC 3Database Technology Group Walldorf, Germany Palo Alto, CA 94304, USA TU Dresden, Germany michael.rudolf01@.com [email protected] [email protected]

ABSTRACT “Cell Phones “Computers & & Accessories” Graph-structured data is ubiquitous and with the advent of social Accessories” 4 networking platforms has recently seen a significant increase in 6 popularity amongst researchers. However, also many business appli- part of part of “Freddy” cations deal with this kind of data and can therefore benefit greatly 10 “Mike” from graph processing functionality offered directly by the underly- 8 5 “Phones” “Tablets” 7 ing database. This paper summarizes the current state of graph data in rates 4/5 processing capabilities in the SAPHANA database and describes our efforts to enable large graph analytics in the context of our research in 11 “Steve” rates 5/5 white 3 “Apple in project SynopSys. With powerful graph pattern matching support at iPhone 4” 16 GB the core, we envision OLAP-like evaluation functionality exposed to rates 5/5 the user in the form of easy-to-apply graph summarization templates. black rates 3/5 black 64 GB 1 32 GB 2 By combining them, the user is able to produce concise summaries 9 of large graph-structured datasets. We also point out open questions “Apple iPad “Apple and challenges that we plan to tackle in the future developments on MC707LL/A” “Carl” iPhone 5” our way towards large graph analytics. Figure 1: Example data expressed in the property graph data model. Edge attributes are typeset in italics; vertex types are Categories and Subject Descriptors indicated through different colors. H.3.3 [Information Systems]: Information Storage and Retrieval— information search and retrieval

General Terms their attention to this subject and fostered the NoSQL and BigData Algorithms, Design, Performance movements. However, traditional business applications often also have to deal with graphs, for example in supply network manage- Keywords ment and traceability, transportation and logistics, as well as for the classic bill of materials. It is therefore safe to say that graph data Graph matching, graph transformation, graph summarization, SAP is ubiquitous in all kinds of business applications and has a long HANA database system history. Graphs come in many different flavors. In its most basic form a 1. INTRODUCTION graph G is made up of a set of vertices V and a relation E V V ⊆ × With the incessantly growing number of participants in social net- representing the edges between them (sometimes also noted as V(G) works and their constant production of interlinked content, the last and E(G), respectively). Depending on the use case, this definition few years have seen a massive rise in the amount and complexity is being extended or altered accordingly (e.g., for undirected graphs of graph-structured data. As new technologies were needed for or hypergraphs). In the remainder of this paper we will focus on the efficiently managing and processing this kind of information in this property graph model [10], because it is very general but neverthe- new order of magnitude, more and more researchers have turned less flexible enough to support many use cases. Also, other graph models can easily be mapped to the property graph model. A property graph is a directed vertex-labeled and edge-labeled Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are graph, meaning that both vertices and edges can have attributes each not made or distributed for profit or commercial advantage and that copies consisting of a key and a value. Figure 1 shows an example of a bear this notice and the full citation on the first page. To copy otherwise, to property graph for products, categories, users, and ratings, which republish, to post on servers or to redistribute to lists, requires prior specific could serve as the underlying data for a recommendation engine. permission and/or a fee. Sometimes it can be helpful to have a dedicated attribute designate Proceedings of the First International Workshop on Graph Data Manage- the semantic type of the vertex or edge. However, this semantic type ment, Experiences and Systems (GRADES 2013) June 23, 2013, New York, NY, USA does not imply any structural constraints, i.e., two vertices of the Copyright 2013 ACM X-XXXXX-XX-X/XX/XX ...$15.00. same type can have different attributes. In Figure 1 the type attribute ela o adighg rnatoa okod 6.Exploiting [6]. workloads transactional high handling for as well the within en apdt al oun xliigtecaatrsisof characteristics the Exploiting column. attribute table each a with to rely edges, mapped They and being vertices 2. storing Figure for in tables shown universal is on as engine, store column in-memory the to added processing. graph for and engines search with text representations physical column-oriented and row- the end, that To [12]. heteroge- applications tremendously business of a spectrum for neous capabilities processing and of storage amounts multi-core large and leverages memory it main developments, hardware recent the as processes business analytical complex supporting for designed the of core the at Situated THE IN PROCESSING GRAPH 2. work. future of 6 sketch Section a Finally, with SynopSys. paper tackle project the to research concludes us our for of matching context illus- pattern the which graph in 5, in Section challenges by main followed the is trates This templates and contribution. matching main pattern our presents of as 4 means Section by summarization. analytics graph graph analyze of large we field 3 the Section in In work support. related processing graph available the of introduces shortly section lowing the within functionality cessing graph the in embodied knowledge uncover help classic topology. they the that providing way to a addition in Therefore, mak- upon. decision rely meaningful for to time extract reasonable ers to in graphs need large applications chal- from information a Business is data graph-structured task. of lenging amounts the large through of indicated simplicity. sense is of Making sake it the rather for explicit, colors made different not is vertices of functionality processing graph the of Integration 2: Figure HANA HANA OLAP

h rp aapoesn aaiiishv nyrcnl been recently only have capabilities processing data graph The pro- graph upcoming and existing the describe we paper this In Relational Stack A AADATABASE HANA SAP Abstraction ucinlt,gahsmaissol epoue nsuch in produced be should summaries graph functionality, aaaei ni-eoyrltoa aaaesse htwas that system database relational in-memory an is database aaaefdrtsahbi eainlegn uprigboth supporting engine relational hybrid a federates database Relational Compiler Runtime Layer SQL SQL ounSoeOperators Store Column A HANA SAP A HANA SAP ODBC oeClm tr Primitives Store Column Core / aaae[11]. database JDBC aaaeadaetgtyitgae nothe into integrated tightly are and database A HANA SAP Compiler Runtime WIPE WIPE CPU A HANA SAP rp btato Layer Abstraction Graph A HANA SAP opoiehigh-performance provide to s plac rdc,the product, Appliance aaaesse.Tefol- The system. database n ie noverview an gives and Function Library Graph RPC

SAP SAP cieIfrainStore Information Active (called etxadeg trbtsso odpromne eas they because performance, good show attributes edge and vertex ffeunl-sdagrtm ilb rvddi h omo a of form the in provided be will implementations algorithms parameterizable of frequently-used set of A within [11]. example procedure for stored algorithms a graph custom enables of which implementation introduced, the been has interface programming oriented isolation. of level con- required atomicity, the guarantees and system durability, the sistency, that so context, transaction domain-specific a other to the Similar of languages system. appli- database the the between roundtrips and several cation statement, for single need a the within reducing combined thereby be to operations of complex help operator.ple the plan with execution out database carried dedicated are a latter The traversals. path of efficient combination edges, the and allows vertices deleting it and updating, creating, for operations manipu- and query language graph lation domain-specific declarative the for port capabilities. new these from tremendously graph-structured benefit but with can dealing information applications; applications business business of existing the types envision also novel we completely a foundation of into powerful it creation this convert With to having and format. without queried different place be same can the data that in such manipulated the way, graph a Thus, with in extended functionality is engine. processing toolkit relational processing infrastructure the data the relational with leverage established can it we integrate it, efficiently of of top and on instead layer core new database a the creating pro- within graph directly offering By functionality reorganizations. cessing physical any imply not do removing and adding for operations the layout, data columnar the grgtoso h rgnlntok h uhr oee w kinds two foresee authors The network. original the of networkaggregations aggregate an of notion the edges) define on they being attributes attributes), dimensions vertex no the the (with (e.g., network model multidimensional as graph introduced restricted introduced. a is [15] on Cube Based Graph a called Google, model and warehousing Microsoft data with novel collaboration in time this Champaign evaluation or architecture. processing or any concepts, by specification, accompanied However, not consumption. is approach memory their reduce help can techniques tion the theoreti- form a which present graphs, in also measures aggregated they computing paper for their foundation In cal snapshot. such each in attributes sions associated the consider and time the over snapshots changing assume graph They a them. of to slice/dice and well-known drill-down, the roll-up, map and dimensions of initions the extending ( for processing approach analytical their online presented Chicago and Champaign graph and visualization graph of fields [9]. sum- the compression graph to good related finding a to also of within is that, problem maries meaningful by The are and use. of that them context information summarize particular to of large able pieces in be small encoded to extract information crucial the is use it and graphs understand to order WORK In RELATED 3. from. choose to developers application for Library Function Graph nadto ota,agahasrcinlyrwt nobject- an with layer abstraction graph a that, to addition In sup- [2], project Store Information Active ongoing the of part As nadfeetpprfo h nvriyo liosa Urbana- at Illinois of University the from paper different a In Urbana- at Illinois of Universities the from group a paper 2008 a In B T IBM nomtoa dimensions informational hs oigfo h trbtso etcsadedges and vertices of attributes the from coming those are cuboid . J . OLAP asnRsac etrte otiuetebscdef- basic the contribute they Center Research Watson .Agahcb osiue hntesto l possible all of set the then constitutes cube graph A ). A HANA SAP WIPE emnlg,adso o ata materializa- partial how show and terminology, a mlmne.I diint h basic the to addition In implemented. was aaae ttmnsaeeeue within executed are statements database, ncnrs,the contrast, In . BI OLAP lk grgto prtoswith operations aggregation -like ogah 3.Tgte with Together [3]. graphs to ) WIPE oooia dimen- topological OLAP emt multi- permits operations US DE ?nationality ?rating 32 GB black -1 -2 ? 2 ++ AVG(?rating) ?nationality “Apple rates 4.5 rates 4.1 iPhone 5”

Kleene Select Select New vertices and edges 32 GB 2 black (Match All) “Apple iPhone 5” Figure 4: Graph summarization rule for deriving the summary Figure 3: Summary graph showing the average product rating graph shown in Figure 3. of all US-american and German users. Negative identifiers in- dicate newly created vertices. 4. GRAPH ANALYTICS THROUGH SUM- MARIZATION We propose an approach to graph summarization that is related to of OLAP operations: cuboid and crossboid queries. The former the transformation of graphs using graph grammars: with the help of simply returns the aggregate network of the desired cuboid from graph patterns a user can identify the items of interest and produce the graph cube, while the latter can be seen as somewhat similar a summary of them. to a join operation between multiple different cuboids. As for the specification and evaluation of such queries, the authors do not 4.1 Pattern Matching as the Foundation propose any mechanism and again focus on partial materialization Pattern matching in graphs has a broad range of applications. In ad- techniques. dition to the well-known use cases in the fields of pattern recognition Researchers from the University of Michigan and the Nokia and artifical intelligence, there are also many business applications Research Center have proposed the SNAP operation for grouping that can benefit from such functionality. For instance, fraud detec- vertices based on user-selected attributes and pairwise relation- tion systems try to find re-occurring patterns in insured events or ships [13]. The resulting vertex groups are homogenous with respect money withdrawals that could indicate criminal activity. Another ex- to the selected attributes and their relationships. This means that if ample are purchase recommendation engines, which can use graph two groups are connected via an edge, each vertex of one group is patterns to capture the context information for a specific customer connected to some vertices of the other, whereas if the groups are and match other purchase orders that are related in some way and not connected, no vertex of one group is connected to any vertex might be of interest to that customer. of the other. In practice this behavior turns out to be quite limiting, For a graph pattern matching technology serving as the foundation because it can result in a large number of groups. Therefore, they of large graph analytics we see the following requirements: propose the k-SNAP operation as an extension, where the homogene- ity constraint for group relationships is relaxed and the user can Match Modes: With increasing data volumes it makes sense • specify the number of groups in the graph summaries. Changing to differentiate between two match modes: match-all and this number then has the same effect as the OLAP operations drill- match-any. As the names suggest, the former returns all oc- down and roll-up. They prove that the computation of the k-SNAP currences of the pattern in the graph, while the latter returns operation is NP-complete and propose heuristics to approximate it. only the one found first and stops the matching process. This Although the two proposed operations are designed to work with distinction permits the user to indicate that only the existence different edge types, additional edge attributes are not supported. of a match is of relevance and thereby helps saving computa- In a follow-up paper by a group from the University of Wisconsin- tion resources. Madison and the IBM Almaden Research Center [14], the previous Versatility: For a general graph pattern matching function- approach is improved in two ways: first, the homogeneity require- • ment for vertex attributes is relaxed by permitting the user to specify ality to be of sufficient practical use, a certain versatility is the number of partitions for numerical attributes. Second, for help- inevitable. This is especially true with regard to the supported ing users specifying a sensible number of groups for the k-SNAP predicate types. In practice, only offering value-based equal- operation, an interestingness measure for graph summaries is de- ity comparisons might quickly turn out to be a limiting factor; fined. It is based on different aspects the authors describe as diversity, relational comparisons, regular expressions, and negation (i.e., coverage, and conciseness of a summary. testing for the non-existence of some property) is also needed. Other approaches to graph summarization are mostly statistical in Regular Path Expressions: In some cases the basic building nature. They compute sets of figures (e.g., degree distributions, hop- • blocks for graph patterns (i.e., vertices, edges, and attributes) plots, and clustering coefficients), which describe the characteristics are not expressive enough, for example whenever a pattern of the graph. The approaches presented above are different: they should reflect that two vertices have to be connected via an can produce aggregated views on the graph data in various, user- arbitrary number of hops (i.e., there is a path connecting the controlled resolutions by means of OLAP-like operations. Although two vertices). This can be achieved with the help of regular much more flexible, in our opinion these approaches are still too path expressions [5], which permit the specification of regular rigid in some ways. For example, ideally a user should be able to expressions for such paths. query the property graph in Figure 1 for the average rating of a specific product differentiated by US-american and German users, In order to actually construct graph summaries, a user has to specify and the system should return the summary graph shown in Figure 3 an action to be executed once a graph pattern matches. Both the as the result. Furthermore, neither of the approaches proposes a graph pattern and the actions constitute a summarization rule. Fig- method for specifying or evaluating arbitrary graph summarization ure 4 shows such a summarization rule for deriving the summary recipes. graph depicted in Figure 3 and illustrates the required functionality beyond pattern matching. Operations for adding vertices, edges, and as ?capacities h i attributes are indicated in green color. We extend the concept of vp at = agg(as) “Phones” 5 capacity=AVG match modes to also apply to a single vertex instead of the whole h i h i pattern. By marking at most one vertex with a star symbol ?, we (a) Vertex attribute express that for each match of the rest of the pattern all matches of that vertex will be grouped to form a single result. In the absence a ep ?ratings of the star symbol, the default match mode match-any is used. By h si h i prepending an attribute name with a question mark, we define a 2 at = agg(as) rating=AVG variable of that name and bind it to all the values in such a group. Fi- vp1 h i vp2 “Apple US h i h i nally, for deriving meaningful information from the grouped values, iPhone 5” we require a set of aggregation functions to apply to such variables. (b) Edge attribute

4.2 Summarization Templates Figure 6: Summarization templates for aggregating attribute Summarization rules are a powerful tool and can quickly become values. quite complex. To reduce the hurdles for user adoption we propose to break down the summarization functionality into several simpler templates, which are described in the following. They have to be The right-hand side of Figure 5a shows an exemplary instantiation of instantiated by specifying arguments for their parameters and can the template to illustrate its use. It collects all (storage) capacities of then be combined by applying them sequentially. products in the category “Phones” into a new multiset attribute that is To distinguish the templates from ordinary summarization rules, attached to the vertex representing the category. The different vertex we introduce a slightly different notation: Template parameters are colors are used to represent vertex types consistent with Figure 1 and enclosed in angle brackets and and printed in blue color if they make up the vertex predicates vp1 and vp2. The latter furthermore constitute a part of the graphh pattern.i New vertices, edges, and filters vertices based on the identifier “5”. The edge predicate ep attributes are printed in green color. is the edge type in, the name of the source vertex attribute as is capacity (prepended with a question mark to differentiate it from a predicate and convey the notion of a variable), and finally the name Collecting Attribute Values of the target vertex attribute is capacities. For summarizing information in graphs it is required to collect the The approach for collecting edge attribute values is similar: the attribute values of a set of vertices or edges. The summarization template depicted in the left-hand side of Figure 5b only expects templates shown in Figure 5 can do exactly that. Since multiple as to be the name of an edge attribute instead of a vertex attribute. vertices or edges can have the same attribute values, using a set Then, not the matching source vertices but the edges connecting for collecting them would result in the undesired elemination of them to a matching target vertex contribute their attribute values to duplicates. On the other hand, a list would require an ordering of the new multiset attribute. vertices or edges. Therefore, we use multisets for holding attribute The example on the right-hand side of Figure 5b collects all values, indicated by the delimiter symbols and . ratings of a specific product given by German users into a new {| |} The template on the left-hand side of Figure 5a expects the vertex multiset attribute attached to that product vertex. vp1 is composed predicates vp1 and vp2, the edge predicate ep, the source vertex of the vertex type and a filter for nationality attribute; vp2 checks the attribute name as, and the target vertex attribute name at . Note that vertex identifier in addition to the type. The template instantiation the edge direction is part of the edge predicate and indicated in the does not specify an argument for the edge predicate parameter ep, so figure using half arrowheads pointing in both directions. The match that all edges will match the pattern. The source edge attribute name mode of the source vertex is match-all, meaning that for a fixed as is rating and the name of the target vertex attribute is ratings. matching vertex determined by vp2 all vertices satisfying vp1 that are connected to the former via an edge satisfying ep contribute their Scalar Aggregation attribute values to the new multiset attribute. When collecting attribute values into multisets, it is desirable to compute aggregate values, such as sums or averages. This can be done with the help of one of the two summarization templates shown in Figure 6. a a = a ?capacity capacities h si t {| s|} For aggregating a multiset of vertex attribute values, the following ? ? 5 things have to be specified for the template depicted on the left-hand in ep side of Figure 6a: the vertex predicate vp, the name of the source vp1 h i vp2 “Phones” h i h i vertex attribute as, the name of the target vertex attribute at , and the aggregation function agg. For every matched vertex, the application (a) Vertex attributes of a template instance will yield a new vertex attribute. The exemplary instantiation on the right-hand side of Figure 6a at = as ratings as {| |} ?rating computes the average (storage) capacity of all products in the cate- ? h i ? 2 gory “Phones” (see also Figure 5a). The vertex predicate vp consists ep of the vertex type (expressed through the color) and identifier “5”. vp1 h i vp2 nationality=DE “Apple h i h i iPhone 5” The name of the source vertex attribute as is capacities; it has to designate a multiset that is then passed to the aggregation function (b) Edge attributes agg – in the example this is the AVG function. The name of the target vertex attribute at is specified as capacity and can be used Figure 5: Summarization templates and example instantiations to access the computed average value after the application of the for collecting attribute values. summarization rule. collecting their (storage) capacities into a new multiset attribute. as at = as ?capacity capacities h i ea1, . . . , ean {| |} The vertex predicate vp is a regular expression, the source vertex ? h i ++ ? ++ attribute name as is capacity, and the target vertex attribute name is capacities. While no additional edge attributes ea1,..., ean have vp va1, . . . , vam “Apple*” “Apple h i h i Products” been specified, a single additional vertex attribute with the value “Apple Products” will be created. (a) Collapse vertices The template for collapsing edges is shown in the upper part of Figure 7b. Its parameters are the vertex predicates vp1 and vp2, va h si the names of the source vertex attribute vas and the target vertex vp ? attribute vat , the edge predicate ep, the edge direction, the names of h 1i the source edge attribute eas and the target edge attribute eat , and additional attributes va1,..., vam and ea1,..., ean for the vertices ep eas and edges to be created. h i h i The match mode of the source vertex is again match-all. Thus, vat = vas ea1, . . . , ean {| |} for each vertex v satisfying vp a new vertex will be created with vp h i 2 2 ++ the attributes va ,..., va and connected to it via a new edge e with h i eat = eas 1 m {| |} va1, . . . , vam h i the attributes ea1,..., ean. Then all vertices satisfying vp1 that are connected to v via an edge satisfying ep will be matched and an nationality=US edge to the newly created vertex will be added. Finally, the values of ? the source vertex attribute vas will be collected into a new multiset attribute vat of the new vertex v and the values of the source edge attribute eas will similarly be gathered into a new multiset attribute ?rating eat of the new edge e. The lower part of Figure 7b illustrates the template with an ex- ample that will collapse all US-american users and their rating of 2 ++ a specific product into a new vertex. The vertex predicate vp is ratings 1 “Apple US composed of one filter for the vertex type and another one for the iPhone 5” nationality; vp2 checks the vertex identifier and type. Since the template instantiation does not specify an argument for the edge (b) Collapse edges predicate parameter ep, all edges will match the pattern. The name of the source edge attribute ea is rating; all values in a match will be Figure 7: Summarization templates and example instantiations s collected in the new multiset edge attribute ea named ratings. The for collapsing vertices and edges. t new nationality attribute with the value “US” is the only additional vertex attribute; no additional edge attributes ea1,..., ean have been specified and the names of the source and target vertex attributes The template on the left-hand side of Figure 6b can be used for vas and vat have been omitted as well. aggregating a multiset of edge attribute values in a similar way. It expects the vertex predicates vp1 and vp2 and the edge predicate ep; 5. CHALLENGES IN GRAPH PATTERN as and at denote the names of edge attributes. The application of this summarization template results in a new MATCHING attribute for every matched edge. This is illustrated by the example The foundation of the template-based graph summarization approach on the right-hand side of Figure 6b, which computes the average for large graph analytics presented in the previous section is a pow- rating of a specific product for a user group. erful graph pattern matching mechanism. The problem of finding occurrences of a pattern within a larger graph has already been Collapsing Vertices and Edges investigated for decades [4]. However, only very few commercial The most complex summarization templates are the ones for col- products actually offer a solution to it. In this section we formulate lapsing vertices and edges shown in Figure 7. Their main purpose is the challenges that we need to tackle in order to provide an effective to introduce new vertices that represent a group of related vertices and efficient matching technology. or edges. When collapsing vertices with the help of the template depicted Fuzzy Matching on the left-hand side of Figure 7a, the user can specify the vertex Business applications having to process large amounts of data are predice vp, the source vertex attribute name as, the target vertex often confronted with erroneous information (e.g., information that attribute name at , and additional attributes va1,..., vam and ea1, contains spelling mistakes or is outright false). As a consequence, ..., ean for the vertices and edges to be created. strict pattern matching will in many cases not return satisfactory Similar to the summarization templates for collecting attribute results. However, when extended with the notion of fuzziness, match- values, the match mode of the source vertex is match-all. That ing algorithms can be made lenient towards the problems outlined means that for all vertices satisfying vp a new vertex representing above and also return valuable results that only match the given them will be created with the attributes va1,..., vam and connected pattern to some degree. to them via new edges with the attributes ea1,..., ean. Furthermore, There are several approaches to achieve this, the most obvious the values of the source vertex attribute vas will be collected into a ones stemming from the area of fuzzy text search, where the met- new multiset attribute vat of the new vertex. ric used to rank inexact matches is based on some edit distance The right-hand side of Figure 7a shows an example for collaps- defined on graphs [7]. Still, for some use cases computing such ing all Apple products into a single vertex while at the same time an edit distance with a fixed algorithm might be insufficient and a more fine-grained mechanism may be required: the most flexible 7. ACKNOWLEDGEMENTS approach would certainly permit users to specify weights indicating We would like to thank Hannes Voigt for opening new perspectives which vertices, edges, or attributes in the pattern are more important in many inspiring discussions and for his feedback on earlier ver- than others and which might even be mandatory for a match to be sions of this paper. We also express our gratitude to our fellow Ph.D. considered valid. students in Walldorf for their encouragement and support. Pattern Representation 8. REFERENCES A question open for debate is the representation of patterns. For example, a dedicated domain-specific languange based on regu- [1] K. Bollacker, . Evans, P. Paritosh, T. Sturge, and J. Taylor. lar path expressions [5] could be used. In general, a declarative Freebase: a collaboratively created for approach (i.e., specifying what to match) is preferrable over a pro- structuring human knowledge. In Proc. ACM SIGMOD, pages cedural encoding of how to perform the matching (possibly against 1247–1250, New York, NY, USA, 2008. ACM. a system-provided programming interface), because it can be better [2] C. Bornhövd, R. Kubis, W. Lehner, H. Voigt, and H. Werner. optimized. Flexible Information Management, Exploration, and Analysis The example instantiations of the templates presented in the pre- in SAP HANA. In Proc. International Conference on Data vious section are specified as property graphs themselves. On the Technologies and Applications, pages 15–28. SciTePress, right-hand side of Figure 7a a pattern for matching certain prod- 2012. ucts by name uses a regular expression. However, the data model [3] C. Chen, X. Yan, F. Zhu, J. Han, and P. S. Yu. Graph OLAP: usually only permits a limited set of types for attribute values (e.g., Towards Online Analytical Processing on Graphs. In Proc. 8th integers, floating point numbers, and strings) and does not support International Conference on Data Mining, pages 103–112, designating vertex or edge attribute with special semantics. Still, for Pisa, Italy, Dec. 2008. IEEE. a matching algorithm to differentiate between the various predicate [4] D. Conte, P. Foggia, C. Sansone, and M. Vento. Thirty Years types (e.g., relational comparisons, regular expressions, and nega- of Graph Matching in Pattern Recognition. International tion), this would be required. One possible solution could be to rely Journal of Pattern Recognition and Artificial Intelligence, on a convention for the type and structure of attribute values as is 18(3):265–298, 2004. done in the Metaweb Query Language of Freebase [1]. [5] W. Fan, J. Li, S. Ma, N. Tang, and Y. Wu. Adding Regular Expressions to Graph Reachability and Pattern Queries. In Leveraging Existing Functionality Proc. 27th ICDE, pages 39–50, Hannover, Germany, Apr. By implementing support for graph matching as part of the graph 2011. IEEE. abstraction layer directly within the SAPHANA database, we can ben- [6] F. Färber, S. K. Cha, J. Primsch, C. Bornhövd, S. Sigg, and efit from the efficiency of the in-memory column store operations. W. Lehner. SAP HANA Database: Data Management for For example, wherever possible we should try to map the different Modern Business Applications. SIGMOD Rec., 40(4):45–51, predicate types to similar functionality in the relational engine or Jan. 2012. the text engine. Another case for re-using existing functionality are [7] X. Gao, B. Xiao, D. Tao, and X. Li. A survey of graph edit regular path expressions. They could be implemented with the help distance. Pattern Anal. Appl., 13(1):113–129, Jan. 2010. of the traversal operator that is part of the graph abstraction layer in [8] M. Karpinski and W. Rytter. Fast Parallel Algorithms for the SAPHANA database. Graph Matching Problems. Oxford Lecture Series in Exploiting Parallelism Mathematics and its Applications. Oxford University Press, May 1998. The widespread availability of multi-core CPUs and the parallel ex- [9] S. Navlakha, R. Rastogi, and N. Shrivastava. Graph ecution support of the SAPHANA database provide the foundation summarization with bounded error. In Proc. ACM SIGMOD, for fast data processing. By carefully choosing easily parallelizable pages 419–432, New York, NY, USA, 2008. ACM. graph matching algorithms [8] we can exploit this concurrency po- [10] M. A. Rodriguez and P. Neubauer. Constructions from Dots tential and craft an efficient solution for large graph analytics. In and Lines. Bull. American Society for Information Science addition, this approach also scales in the case of very large graphs and Technology, 36(6):35–41, 2010. that are partitioned and distributed over a number of database in- stances. [11] M. Rudolf, M. Paradies, C. Bornhövd, and W. Lehner. The Graph Story of the SAP HANA Database. In BTW, LNI, 6. CONCLUSION AND OUTLOOK pages 403–420. GI, 2013. [12] V. Sikka, F. Färber, W. Lehner, S. K. Cha, T. Peh, and In this paper we have given an overview of the current state of the C. Bornhövd. Efficient Transaction Processing in SAP HANA technology of graph data processing in the SAPHANA database. We Database: The End of a Column Store Myth. In Proc. ACM have sketched use cases to illustrate the need for pattern matching SIGMOD, pages 731–742, New York, NY, USA, 2012. ACM. in graphs and described the challenges and open questions we face. [13] Y. Tian, R. A. Hankins, and J. M. Patel. Efficient Aggregation En route to enabling large graph analytics within the SAPHANA for Graph Summarization Categories and Subject Descriptors. database as part of our research project SynopSys, we have outlined In Proc. ACM SIGMOD, pages 567–580, Vancouver, BC, how this technique can form the basis of graph summarization Canada, 2008. ACM. functionality. Finally, we have presented the basic operations needed [14] N. Zhang, Y. Tian, and J. M. Patel. Discovery-Driven Graph for effectively summarizing large graphs, captured in the form of Summarization. In Proc. 26th ICDE, pages 880–891, Long summarization templates to be instantiated and applied by the user. Beach, CA, USA, 2010. IEEE. In the future we would like to investigate whether and how the summarization functionality can be extended to support arbitrary [15] P. Zhao, X. Li, D. Xin, and J. Han. Graph Cube: On general-purpose graph transformations and how those can be effi- Warehousing and OLAP Multidimensional Networks. In Proc. ACM SIGMOD ciently implemented in the context of the SAPHANA database. , pages 853–864, Athens, Greece, 2011. ACM.