Advanced Techniques for Efficient Data Integrity Checking

Roskilde University Department of Computer Science Advanced Techniques for Efficient Data Integrity Checking Ph.D. Dissertation Davide Martinenghi Supervisor: Prof. Henning Christiansen October 2005 Abstract Integrity constraint checking, understood as the verification of data correctness and well- formedness conditions that must be satisfied in any state of a database, is not fully supported by current database technology. In a typical scenario, a database is required to comply with given semantic criteria (the integrity constraints) and to maintain the compliance each time data are updated. Since the introduction of the SQL2 standard, the SQL language started supporting assertions, which allow one to define general data consistency requirements expressing arbitrarily complex “business rules” that may go beyond predefined constraints such as primary keys and foreign keys. General integrity constraints are, however, far from being widely available in commercial systems; in fact, their usage is commonly not encouraged, since the database management system would not be able to provide their incremental evaluation. Given the size of today’s data repositories and the frequency at which updates may occur, any non-incremental approach, even for conditions whose complexity is only linear in the size of the database, may prove unfeasible in practice. Typically it is the database designer and the application programmer who take care of enforcing integrity via hand-coded pieces of programs that run either at the application level or within the DBMS (e.g., triggers). These solutions are, however, both difficult to maintain and error prone: small changes in a database schema may require subtle modifications in such programs. In this respect, database management systems need to be extended with means to verify, automatically and incrementally, that no violation of integrity is introduced by database updates. For this purpose we develop a procedure aimed at producing incremental checks whose satisfaction guarantees data integrity. A so-called simplification procedure takes in input a set of constraints and a pattern of updates to be executed on the data and outputs a set of optimized constraints which are as incremental as possible with respect to the hypothesis that the database is initially consistent. In particular, the proposed approach allows the compilation of incremental checks at database design time, thus without burdening database run time performance with expensive optimization op- erations. Furthermore, integrity verification may take place before the execution of the update, which means that the database will never reach illegal states and, thus, rollback as well as repair actions are virtually unneeded. The simplification process is unavoidably bound to a function that gives an approxi- mate measure of the cost of evaluating the simplified constraints in actual database states and it is natural to characterize as optimal a simplification with a minimal cost. It is shown that, for any sensible cost function, no simplification procedure exists that returns optimal results in all cases. In spite of this negative result, that holds for the most general setting, important contexts can be found in which optimality can indeed always be guaranteed. Furthermore, non-optimal simplification may imply a slight loss of efficiency, but still is a great improvement with respect to non-incremental checking. Finally, we extend the applicability of simplification to a number of different contexts, such as recursive databases, concurrent database systems, data integration systems and XML document collections, and provide a performance evaluation of the proposed model. Acknowledgements I would like to express my first words of gratitude to my supervisor, Henning Christiansen, knowing that my thanks cannot compensate the enormous dedication and the long hours of discussion that he devoted to my work. His patience and sharpness of mind have been invaluable means for my understanding of the subject and for the improvement of my drafts. I am also very grateful to Stefano Ceri, who let me join the database group at Po- litecnico di Milano for a period of six months for a fruitful and pleasurable collaboration. Among the people in Milano with whom I had insightful discussions and exchanged ideas, I would also like to thank Daniele Braga, Alessandro Campi, Marco Colombetti, Carlo Alberto Furia, Stefano Paraboschi, Alessandro Raffio, Damiano Salvi, and Paola Spole- tini. I am also indebted to Hendrik Decker, with whom I spent a week of very intensive and productive work at the Instituto Tecnológico de Informática in Valencia. The elegance and precision of his writings have been a model for me during these months. Roskilde University was also a precious source of information and learning. I partic- ularly wish to thank Torben Braünerfor his lectures on modal and hybrid logics, John Gallagher for introducing me to partial evaluation, and Jørgen Villadsen for sharing his interest in paraconsistent logics and higher-order logics. I would also like to thank the members of the Program Committee of the First International Workshop on Logical Aspects and Applications of Integrity Constraints (LAAIC’05), who agreed to provide their expertise for the success of an event that is very close to my own research: Marcelo Arenas, Andrea Cal`ı,Stefano Ceri, Henning Chris- tiansen, Hendrik Decker, Parke Godfrey, Mohand-Said Hacid, Maurizio Lenzerini, Rainer Manthey, Rosa Meo, and Jack Minker. Special thanks are due to Amos Scisci, who, even when submerged in his activities as a man of letters, found the time to provide me with useful comments on some of my manuscripts. I am very grateful to Anders Winther, who helped me with the Danish translation of the abstract. And of course to Céline, who was the first one to believe in my project. Dansk resumé Kontrol af en databases integritet, forst˚aet som verifikation af betingelser for korrekthed og “velformethed” af data, understøttes kun i ringe omfang af den databaseteknologi, som anvendes i dag. Typisk m˚aen database forventes at overholde givne semantiske betingelser (integritetsbegrænsningerne) og at opretholde disse, hver gang databasen opdateres. Allerede ved fastsættelsen af SQL2-standarden i 1992, som er en anerkendt standard for definition af og interaktion med relationelle databaser, har sproget SQL in- deholdt s˚akaldte assertions, som gør det muligt at beskrive vilk˚arligt komplekse “business rules” som rækker ud over de prædefinerede typer af begrænsninger s˚asom primær- og fremmednøgler. Generelle integritetsbegrænsninger er desværre ikke særligt anvendelige i kommercielle databasesystemer. Faktisk er det ikke ualmindeligt at producenterne op- fordrer til at man ikke benytter dem, da teknologien ikke understøtter en inkrementel evaluering. Set i forhold til størrelsen af typiske datasamlinger af i dag, og den hyppighed med hvilken de opdateres, er en ikke-inkrementel tilgang i de fleste tilfælde ikke brugbar i praksis, selv for begrænsninger som “kun” er lineære i forhold til størrelsen af databasen. I de fleste tilfælde m˚adatabasedesigneren i samarbejde med applikationsprogrammøren h˚andtere integriteten ved at h˚andkode programstumper som indlejres i applikationspro- grammet eller databasen (f.eks. som s˚akaldte triggers). Denne praksis er forbundet med nogle problemer i og med at vedligeholdelse er besværlig og medfører risiko for program- meringsfejl: selv sm˚aændringer i databasens skema kan kræve subtile ændringer af disse programmer. Der er s˚aledes et behov for at databasesystemerne udvides med faciliteter til, automatisk og inkrementelt, at kunne verificere at ingen opdatering f˚ar lov at ødelægge integriteten. Med udgangspunkt i denne problemstilling har vi udviklet en metode til automatisk at producere betingelser rettet mod inkrementel kontrol som garanterer at integriteten bevares. En procedure for s˚akaldt simplifikation tager som input et sæt af integritetsbegrænsninger plus et mønster for mulige opdateringer, og den producerer afledte begrænsninger, som er optimerede med hensyn til inkrementel evaluering baseret p˚aen hypotese om at databasen er konsistent fra starten. Det skal fremhæves, at den foresl˚aede teknik muliggør konstruktion af inkrementelle betingelser p˚aet tidspunkt, hvor databasen designes, dvs. før den sættes i drift, s˚aledes at potentielt tidskrævende opti- meringer ikke belaster n˚ar databasen er i drift. Desuden kan disse betingelser kontrolleres før en p˚atænkt opdatering udføres, s˚adatabases aldrig, end ikke midlertidigt, mister sin integritet; s˚aledes kan besværlige reetableringer og s˚akaldte rollbacks undg˚as. Det at foretage denne simplifikation s˚agodt som muligt, er uomgængeligt forbundet med en funktion som giver en tilnærmet m˚al af omkostningen for efterfølgende at eval- uere de simplificerede betingelser i konkrete tilstande af databasen. Det er naturligt at karakterisere en simplifikation som optimal, hvis den minimerer værdien af denne funktion. Vi er i stand til at vise, med enhver rimelig omkostningsfunktion, at der ikke kan findes simplifikationsprocedurer som altid producerer optimale resultater. Dette generelle resultat forhindrer dog ikke at vi kan udpege vigtige og relevante omr˚ader, hvor optimale resultater kan garanteres. I de tilfælde, hvor en ikke-optimal simplifikation kan medføre et tab i forhold til en ideel effektivitet, er det dog stadig tale om en essentiel forbedring i forhold til en ikke-inkrementel evaluering. Endelig undersøger vi anvendelsen af simplifikation

Load more