Lecture Notes in 9510

Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board David Hutchison Lancaster University, Lancaster, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Friedemann Mattern ETH Zurich, Zürich, Switzerland John C. Mitchell Stanford University, Stanford, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Dortmund, Germany Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for , Saarbrücken, Germany More information about this series at http://www.springer.com/series/8637 Abdelkader Hameurlain • Josef Küng Roland Wagner • Hendrik Decker Lenka Lhotska • Sebastian Link (Eds.)

Transactions on Large-Scale Data- and Knowledge- Centered Systems XXIV

Special Issue on - and Expert-Systems Applications

123 Editors-in-Chief Abdelkader Hameurlain Roland Wagner IRIT, Paul Sabatier University FAW, University of Linz Toulouse Linz France Austria Josef Küng FAW, University of Linz Linz Austria Guest Editors Hendrik Decker é Sebastian Link Universidad Polit cnica de Valencia University of Auckland Valencia Auckland Spain New Zealand Lenka Lhotska Czech Technical University Prague Czech Republic

ISSN 0302-9743 ISSN 1611-3349 (electronic) Lecture Notes in Computer Science ISBN 978-3-662-49213-0 ISBN 978-3-662-49214-7 (eBook) DOI 10.1007/978-3-662-49214-7

Library of Congress Control Number: 2015943846

© Springer-Verlag Berlin Heidelberg 2016 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made.

Printed on acid-free paper

This Springer imprint is published by SpringerNature The registered company is Springer-Verlag GmbH Berlin Heidelberg Preface

The 25th International Conference on Database and Expert Systems Applications (DEXA 2014), with proceedings published in volumes 8644 and 8645 of Springer’s Lecture Notes in Computer Science, featured some outstanding keynote presentations and regular articles. As with previous editions of the conference, the program co-chairs of DEXA 2014 invited some of the authors to submit extended papers to a special issue of the Springer journal Transactions on Large-Scale Data- and Knowledge-Centered Systems (TLDKS). Following these invitations, one keynote paper and eight regular articles were submitted. Apart from the keynote paper, each submission was carefully assessed by at least two (often more) recognized experts in the respective field. In two rounds of assessment, 35 reviews were received, most of them of very good quality. In the end, six of the eight regular papers were accepted for inclusion in this special issue, in addition to a revised and extended keynote paper. The contributions in this special issue address a wide range of important contem- porary subject areas in data-centric systems and applications, including reflective modeling, big data, similarity search, large-scale data replication, bioinformatic workflows, data pricing, and data anonymization. In good DEXA tradition, all con- tributions distinguish themselves by the novelty and innovation they bring to these subject areas. The keynote paper is authored by Dirk Draheim, who, apart from teaching, leads the Information Technology Services Centre at the University of Innsbruck (Austria), and in particular the High-Performance Computing Department of that center. He is a distin- guished expert in the field of systems modeling. Theoretical work on modeling usually focuses on logical abstractions of data structures, objects, processes, and object-level language constructs. Additionally, meta-data are of increasing relevance and importance in many fields of information processing. Hence, what is also needed, yet rarely pro- vided, are means to make metadata accessible on the object level. The main contribution of the keynote paper, entitled “Reflective Constraint Writing,” with subtitle “A Sym- bolic Viewpoint of Modeling Languages,” consists in a formal treatment of how to add reflection to object-oriented constraint languages. The work presented in this paper is extensive, covering both introspective as well as manipulative data access, for a large variety of purposes, such as to make models robust against unwanted updates, to give precise semantics to existing modeling language constructs, to enable a more adequate system analysis, to assure the quality of system design, and so on. The article “PPP-Codes for Large-Scale Similarity Searching” is co-authored by David Novak and Pavel Zezula from Masaryk University, Brno, Czech Republic, and supported by the Czech Research Foundations. Their research addresses the chal- lenging problem of efficiently identifying objects in large search spaces that are similar to a given object. The contribution is a two-phase search algorithm on top of a new sophisticated data structure called PPP-Code index. Phase one computes independent rankings based on a given distance function, while phase two aggregates these rankings VI Preface to access similar objects sooner. Experiments with artificial and real-world data show that the algorithm reduces the size of output candidates by up to two orders of mag- nitude while preserving the quality of the answer. Mouhamadou Ba, Sébastien Ferré, and Mireille Ducassé from IRISA/INSA Rennes and the University of Rennes, France, co-authored the article “Solving Data Mis- matches in by Generating Data Converters.” Their research addresses the prevalent problem where different bioinformatics services with mismatches between given outputs and required inputs need to be composed into workflows. The main contribution is an automatic converter that utilizes a rule-based convertibility detection mechanism. Experiments with real-world data types and services from the bioinformatics domain yielded new composition strategies that domain experts were not made aware of by existing ad-hoc approaches. The paper entitled “A Framework for Sampling-Based XML Data Pricing” has been written by Ruiming Tang, Antoine Amarilli, Pierre Senellart, and Stephane Bressan. The authors are affiliated to the National University of Singapore, the National Sci- entific Research Centre in Paris (France), or to both. The paper presents a sharp-witted approach to determining the market value of XML data based on data samples. The price of the data depends on the degree of completeness of sampling and on the contextual quality of data. In other words, data completeness can be traded for a discount price. The paper is one of the first to reflect the growing perception of data as merchandizing objects. In fact, the importance of data pricing is very likely to increase with the expected expansion of electronic information trading. Hence, for future work in that growing trend, this paper can be expected to become a point of reference. The authors Nikolaos Nodarakis, Evaggelia Pitoura, Spyros Sioutas, Athanasios Tsakalidis, Dimitrios Tsoumakos, and Giannis Tzimas of the paper “kdANN+: A Rapid AkNN Classifier for Big Data” work at the universities of Patras, Ioannina, the Ionian University in Corfu, and the Technological Educational Institute in Patras, . They propose the use of kNN classification for multidimensional objects. Their paper reports on a novel application in the area of big data, based on the all k-nearest neighbor query method. A divide-and-conquer strategy is pursued: Data space decomposition techniques are deployed for reducing the demand of computational resources. The authors have verified the viability of their solution on experimental data sets. By increasing the dimensionality of the dataset, the total execution cost may exceed the computational power of the cluster infrastructure at hand. To cope with that, the authors propose dimensionality reduction techniques. Their results exhibit differ- ences of computation time and cost between the examined algorithms kdANN and kdANN+, with regard to space dimensionality, the granularity of space decomposition, and the number of nearest neighbors. The paper entitled “Optimizing Inter-Data-Center Large-Scale Database Parallel Replication with Workload-Driven Partitioning” is authored by Zhen Gao, Hong Min, Xiao Li, Jie Huang, Yi Jin, and An Lei. They are affiliated to various IBM labs in the USA or China, to Tongji University in Shanghai (China), or Pivotal Inc. in Beijing (China). The authors propose two algorithms in order to, firstly, partition a large number of workload-associated tables into a minimal number of point-in-time con- sistency groups that respect some latency constraints, and, secondly, to refine such partitioning by minimizing the number of transaction splits among the resulting Preface VII consistency groups. Point-in-time consistency is an important property of replicated data and a critical objective of distributed database management systems. In particular, it is relevant to the design of data repositories that need to be able to handle unexpected shut-downs, so that replica consistency can be recovered. Given the growing use of replication for handling critical big data management challenges, the potential impact of this paper is obvious. The paper entitled “Anonymization of Data Sets with NULL Values” is authored by Margareta Ciglic, Johann Eder, and Christian Koncilia, from the Alpen Adria University at Klagenfurt (Austria). The paper deals with the problem of anonymizing data with missing or unknown values. NULL values usually represent epistemic gaps, and, in the context of this paper, should not be confused with values that have been used deliberately in order to anonymize data. A solution to the problem of anonymizing data sets with unknown component values has been missing. Rather, database table rows containing NULL-valued attributes are offhandedly discarded in conventional approaches. That, however, may easily yield a disturbing loss of information, which may distort data analysis results and thus lead to faulty decision making. Thus, the paper meets an evident desirability of solutions that do not ignore NULL values, in particular in the context of large volumes of data with a big informational and structural variety, where NULL values are ubiquitous. To conclude, we would like to thank all authors for their contributions to this special issue. Also, we are grateful to all reviewers for their invaluable work in assessing the papers, thus contributing to the high quality of this collection of articles. Last, but not least, our gratitude goes to Gabriela Wagner, whose editorial assistance and handling of all the communication with the authors and the reviewers finally made this volume possible.

October 2015 Hendrik Decker Lenka Lhotska Sebastian Link Organization

Editorial Board

Reza Akbarinia Inria, France Stéphane Bressan National University of Singapore, Singapore Dagmar Auer FAW, Austria Bernd Amann LIP6 – UPMC, France Francesco Buccafurri Università Mediterranea di Reggio Calabria, Italy Qiming Chen HP-Lab, USA Mirel Cosulschi University of Craiova, Romania Dirk Draheim University of Innsbruck, Austria Johann Eder Alpen Adria University Klagenfurt, Austria Georg Gottlob Oxford University, UK Anastasios Gounaris Aristotle University of Thessaloniki, Greece Theo Härder Technical University of Kaiserslautern, Germany Andreas Herzig IRIT, Paul Sabatier University, France Dieter Kranzlmüller Ludwig-Maximilians-UniversitätMünchen, Germany Philippe Lamarre INSA Lyon, France Lenka Lhotská Czech Technical University in Prague, Czech Republic Vladimir Marik Czech Technical University in Prague, Czech Republic Franck Morvan Paul Sabatier University, IRIT, France Kjetil Nørvåg Norwegian University of Science and Technology, Norway Gultekin Ozsoyoglu Case Western Reserve University, USA Themis Palpanas Paris Descartes University, France Torben Bach Pedersen Aalborg University, Denmark Günther Pernul University of Regensburg, Germany Sherif Sakr University of New South Wales, Australia Klaus-Dieter Schewe University of Linz, Austria A. Min Tjoa Vienna University of Technology, Austria Chao Wang Oak Ridge National Laboratory, USA

External Reviewers

José Enrique Universidad Pública de Navarra, Spain Armendáriz-Iñigo Nadia Bennani INSA Lyon, France Juliette Dibie-Barthélemy AgroParisTech, France Norbert Fuhr Universität Duisburg-Essen, Germany Francesco Guerra Università degli Studi di Modena e Reggio Emilia, Italy X Organization

Yasunori Ishihara Osaka University, Japan Stefan Katzenbeisser Technische Universität Darmstadt, Germany Michal Krátký Technical University of Ostrava, Czech Republic Udo Kruschwitz University of Essex, UK Pawel Labaj University of Natural Resources and Life Sciences, Austria Chao Li Google, USA Hui Ma Victoria University of Wellington, New Zealand Riad Mokadem IRIT, Paul Sabatier University, France Francesc Munoz-Escoi Universitat Politecnica de Valencia, Spain Bozena Pajak Northwestern University, USA Giovanni Maria Sacco University of Turin, Italy Saed Sayad University of Toronto, Canada Thomas Springer TU Dresden, Germany Lena Strömbäck Swedish Meteorological and Hydrological Institute, Sweden Brian Thompson U.S. Army Research Lab, USA Theodoros Tzouramanis University of the Aegean, Greece Lucia Vadicamo National Research Council of Italy, Italy Gottfried Vossen University of Münster, Germany Qing Wang The Australian National University, Australia Lai Xu Bournemouth University, UK Contents

Reflective Constraint Writing: A Symbolic Viewpoint of Modeling Languages ...... 1 Dirk Draheim

PPP-Codes for Large-Scale Similarity Searching ...... 61 David Novak and Pavel Zezula

Solving Data Mismatches in Bioinformatics Workflows by Generating Data Converters ...... 88 Mouhamadou Ba, Sébastien Ferré, and Mireille Ducassé

A Framework for Sampling-Based XML Data Pricing ...... 116 Ruiming Tang, Antoine Amarilli, Pierre Senellart, and Stéphane Bressan kdANN+: A Rapid AkNN Classifier for Big Data...... 139 Nikolaos Nodarakis, Evaggelia Pitoura, Spyros Sioutas, Athanasios Tsakalidis, Dimitrios Tsoumakos, and Giannis Tzimas

Optimizing Inter-data-center Large-Scale Database Parallel Replication with Workload-Driven Partitioning ...... 169 Zhen Gao, Hong Min, Xiao Li, Jie Huang, Yi Jin, An Lei, Serge Bourbonnais, Miao Zheng, and Gene Fuh

Anonymization of Data Sets with NULL Values ...... 193 Margareta Ciglic, Johann Eder, and Christian Koncilia

Author Index ...... 221