Evolutionary Tree-Structured Storage: Concepts, Interfaces, and Applications
Total Page:16
File Type:pdf, Size:1020Kb
Evolutionary Tree-Structured Storage: Concepts, Interfaces, and Applications Dissertation submitted for the degree of Doctor of Natural Sciences Presented by Marc Yves Maria Kramis at the Faculty of Sciences Department of Computer and Information Science Date of the oral examination: 22.04.2014 First supervisor: Prof. Dr. Marcel Waldvogel Second supervisor: Prof. Dr. Marc Scholl i Abstract Life is subdued to constant evolution. So is our data, be it in research, business or personal information management. From a natural, evolutionary perspective, our data evolves through a sequence of fine-granular modifications resulting in myriads of states, each describing our data at a given point in time. From a technical, anti- evolutionary perspective, mainly driven by technological and financial limitations, we treat the modifications as transient commands and only store the latest state of our data. It is surprising that the current approach is to ignore the natural evolution and to willfully forget about the sequence of modifications and therefore the past state. Sticking to this approach causes all kinds of confusion, complexity, and performance issues. Confusion, because we still somehow want to retrieve past state but are not sure how. Complexity, because we must repeatedly work around our own obsolete approaches. Performance issues, because confusion times complexity hurts. It is not surprising, however, that intelligence agencies notoriously try to collect, store, and analyze what the broad public willfully forgets. Significantly faster and cheaper random-access storage is the key driver for a paradigm shift towards remembering the sequence of modifications. We claim that (1) faster storage allows to efficiently and cleverly handle finer-granular modifi- cations and (2) that mandatory versioning elegantly exposes past state, radically simplifies the applications, and effectively lays a solid foundation for backing up, distributing and scaling of our data. This work shows, using the example of tree- structured XML, that the characteristics and advantages of the evolutionary ap- proach have been recognized and consistently implemented { something, which on its own is an important achievement. We present the concepts of our evolutionary tree-structured storage TreeTank and the general-purpose SlidingSnapshot to prove that (3) formerly modification- averse tree encodings can be maintained with logarithmic update complexity, (4) linear read scalability beyond memory limitations is still guaranteed while main- taining logarithmic update characteristics, (5) secure copy-on-write semantics can be extended from the file level to the much finer-granular node level, (6) versioned node-level access is predictable and even realtime-capable, and, that (7) node-level snapshots are as or even more space efficient than page-level or file-level snapshots. In the course of our work, we inspired the Java-based iSCSI implementation jSCSI which proved that (8) high-level language block access is fast and also established the Java benchmark framework PERFIDIX as well as the block touch visualization tool VISIDEFIX. We extend REST, the cornerstone interface of the web, with the ability to access the full version and modification history of a resource and call it (9) Temporal REST. This interface will not only encourage application developers to make use of our evolutionary approach, but it will also foster interactive and collaborative applications because they are, according to our claim (10), less complex to write and performing so well that users can now interactively work with large-scale data. Finally, we provide an outlook on how evolutionary (full-text) indices, applica- tions, and schemas can greatly leverage our contributions and how special-purpose hardware can speed-up our tree-structured storage while using far less energy. Es- pecially our suggested approach to schema handling and evolution has the potential to radically simplify ORM-based software development. ii Kurzfassung Nicht nur das Leben, sondern auch unsere Daten sind einer best¨andigenEvolution unterworfen, sei es in Forschung, Industrie oder im Privaten. Aus einer nat¨urlichen evolution¨aren Sicht entwickeln sich unsere Daten durch eine unabl¨assigeReihe fein- granularer Anderungen,¨ die unz¨ahligeVersionen hervorbringen. Aus einer tech- nischen anti-evolution¨aren Sicht, massgeblich durch technologische und finanzielle Einschr¨ankungenentstanden, betrachten wir die Anderungen¨ nur als vor¨ubergehend und speichern vorwiegend nur die letzte Version unserer Daten. Leider f¨uhrtdas Festhalten am g¨angigenAnsatz, die nat¨urliche Evolution zu ignorieren und die vergangenen Versionen bewusst zu vergessen, zu Verwirrung, Komplexit¨atund Geschwindigkeitseinbussen. Verwirrung, weil wir trotzdem immer wieder auf vergangene Versionen zugreifen m¨ussen. Komplexit¨at, weil wir wiederholt die M¨angelunseres Ansatzes ¨uberwinden m¨ussen. Geschwindigkeitseinbussen, weil Verwirrung gepaart mit Komplexit¨atwenig Erfolg verspricht. Interessanterweise versuchen Nachrichtendienste notorisch, genau die Daten zu sammeln, zu speichern und auszuwerten, die die breite Offentlichkeit¨ bewusst verwirft. Immer schnellere und g¨unstigereSpeicher sind der Haupttreiber f¨ureinen Wech- sel hin zum Nichtvergessen vergangener Anderungen.¨ Wir halten fest, dass (1) schnellere Speicher den effizienteren Umgang mit fein-granularen Anderungen¨ sowie (2) einen eleganteren Zugriff auf vergangene Versionen erm¨oglichen, die Anwendun- gen vereinfachen und eine solide Grundlage f¨urBackup und Verteilung unserer Daten legen. Die vorliegende Arbeit zeigt am Beispiel von XML, dass die Eigen- schaften und Vorteile des evolution¨aren Ansatzes erkannt und konsequent umgesetzt wurden { eine Tatsache, die f¨ursich allein eine wichtige Errungenschaft ist. Wir beweisen an Hand unseres evolution¨aren baumstrukturierten Speichers Tree- Tank sowie des universalen SlidingSnapshot, dass (3) vormals ¨anderungsaverse Kodierungen baumstrukturierter Daten in logarithmischer Zeit ge¨andertwerden k¨onnen,dass (4) die lineare Skalierbarkeit lesender Zugriffe bei gleichzeitig log- arithmischem Aufwand f¨ur Anderungen¨ sichergestellt bleibt, dass (5) das sichere Kopieren-beim-Schreiben von der Datei- auf die wesentlich feiner-granulare Knoten- Ebene angewendet werden kann, dass (6) der versionierte Zugriff auf Knoten-Ebene vorhersagbar und echtzeitf¨ahigist, und dass (7) Snapshots auf Knoten-Ebene max- imal so viel, oft weniger Platz ben¨otigen,wie Snapshots auf Datei- oder Seiten- Ebene. Wir haben zudem die Entwicklung einer Java-basierten iSCSI Implemen- tation namens jSCSI initiiert, an Hand derer gezeigt werden konnte, dass (8) Hochsprachen einen schnellen Zugriff auf Block-orientierte Speicher erm¨oglichen und haben zudem das Java Benchmark Framework PERFIDIX sowie das Tool VISIDEFIX zur Visualisierung von Block-Zugriffen etabliert. Wir erweitern REST, die Kern-Schnittstelle des Internet, (9) um den Zugriff auf die volle Versions- und Anderungshistorie¨ einer Ressource. Temporal REST wird interaktive Anwendungen befl¨ugeln,weil diese, dank unserer Schnittstellen- Erweiterung (10) weniger komplex und so performant in der Ausf¨uhrungsind, dass Benutzer interaktiv mit grossen Datenmengen arbeiten k¨onnen. Schliesslich zeigen wir auf, wie k¨unftigevolution¨are(Volltext-)Indizes, Anwen- dungen und Schemas von unseren Beitr¨agenprofitieren und wie spezialisierte, en- ergiesparende Hardware unseren baumstrukturierten Speicher beschleunigen kann. Insbesondere unsere Anregung zur Arbeit und Evolution an und von Schemas hat das Potential, die ORM-basierte Softwareentwicklung radikal zu vereinfachen. iii Acknowledgements I would like to express my sincere gratitude to my advisor, Prof. Dr. Marcel Waldvogel, for introducing me into the world of research and academia, both during my master and doctoral thesis. He allowed for a great amount of freedom to pursue my own visions and was at all times available with acute reviews and feedbacks. He was never tired to find new funding for my work and conference presentations. I'm really thankful that he sticked by me during my long lasting break due to family matters and even amicably warned me that writing a doctoral thesis while building up an own family and company would be hard { something which turned out to be so true. The whole distributed systems research and computing center group warmly wel- comed and sheltered me for almost five years. I received so much support and friend- liness from Sabine Dietrich, Sylvia Pietzko, Stephan Pietzko, Gerhard Schreiner, Michael L¨angle,Peter Degner, Andreas Kalkbrenner, J¨orgVreemann, and Dr. Ar- shad Islam, among others. I would also like to express my thankfulness to my second advisor, Prof. Dr. Marc Scholl for his valuable input and cooperation with his databases and information systems research group. I spent so many hours with Dr. Alexander Holupirek, Dr. Christian Gr¨un,and Dr. Stefan Klinger, discussing new ideas, visions, and how to write them down to convince academia that they are worthwile. Barbara L¨uthke also spent hours and hours trying to bring our English texts into a presentable form. In the course of my work, I had the great opportunity to collaborate, and extend my knowledge in different fields. Notably with Prof. Dr. Daniel A. Keim and Dr. Florian Mannsman in the field of data analysis and visualization, Universit¨at Konstanz, Prof. Dr. Torsten Grust and Dr. Jens Teuber in the field of query parsing and optimization, Technische Universit¨atM¨unchen, Prof. Dr. Sara I. Fabrikant