Author Harald Lampesberger, MSc

Submission Christian Doppler Laboratory for Client-Centric Cloud Computing, Institute for Application Oriented Knowledge Processing

First Supervisor Prof. Dr. Klaus-Dieter Schewe

LANGUAGE-BASED Second Supervisor Prof. Dr. Joachim Biskup

ANOMALY DETECTION April 2016 IN CLIENT-CLOUD INTERACTION

Doctoral Thesis to confer the academic degree of Doktor der technischen Wissenschaften in the Doctoral Program Technische Wissenschaften

JOHANNES KEPLER UNIVERSITY LINZ Altenberger Str. 69 4040 Linz, Austria www.jku.at DVR 0093696

Sworn Declaration

I hereby declare under oath that the submitted Doctoral Thesis has been written solely by me without any third-party assistance, information other than provided sources or aids have not been used and those used have been fully documented. Sources for literal, paraphrased and cited quotes have been accurately credited. The submitted document here present is identical to the electronically submitted text document.

Linz, 20th April 2016 Harald Lampesberger

iii iv Abstract

For consuming a cloud service, clients and services need to communicate using a variety of languages and protocols. The Extensible Markup Language (XML) is the foundation of electronic in many existing and upcoming cloud standards and subject of this thesis. XML-based protocols are usually declared in the industry standard XML Schema (XSD), and schema validation is a first line of defense against syntactically undesirable protocol messages. However, schemas are not enforced in some XML-based protocols and may not be available, and XSD best practices recommend extension points for loose composition. Schema extension points are in fact wildcards. They exist in many protocol specifications, and they break schema validation. An attacker can add arbitrary content into a document at an extension point without being rejected by schema validation. This is exploited by various attacks, i.e., the signature wrapping attack, and in the last years, several signature wrapping attacks were successfully executed against cloud management interfaces and identity providers. Syntactic schema validation can be effective against this attack but requires a language representation without extension points. In this thesis, I propose a security monitor for language-based anomaly detection in XML-based interaction. The security monitor has a learner and validator component. The learner infers an automaton from syntactically acceptable messages, and the validator utilizes the automaton for identifying syntactically non-acceptable messages in an inter- action. Only learning from positive examples is considered because XML attacks are highly service specific, and violating examples are usually not available. The first contribution is extending XML visibly pushdown automata (XVPAs) toward datatyped XVPAs (dXVPAs) for representing mixed-content XML in the learner. A dXVPA is translated to a character-data XVPA (cXVPA) for efficient stream validation in the validator. The second contribution is a lexical datatype system for generalizing text contents in mixed-content XML by inferring datatypes according to lexical subsumption and a preference heuristic. The third contribution is a set of algorithms for an incremental and set-driven learner. For dealing with poisoning attacks in realistic deployments, the learner has unlearning and sanitization capabilities for removing once-learned examples and trimming low-frequent states and transitions. A prototype has been experimentally evaluated in two synthetic and two simulated scenarios. Realistic data was generated from an Apache Axis2 web service by simulating state-of-the-art XML attacks. The proof of concept showed promising results: the learner only needed a few examples to converge to a stable language representation. Detection rates for all datasets were between 82.35% and 100% without false positives and outperformed traditional schema validation. Poisoning attacks were also successfully removed by unlearning and sanitization. Use cases for integrating the proposed security monitor are, e.g., a middleware security component, an anomaly detection component in XML firewalls, and a client browser plug-in for filtering XML-based resources.

vi Kurzfassung

Für die Nutzung eines Cloud-Services müssen Klienten und Services mit einer Vielfalt an Sprachen und Protokollen kommunizieren. Die Extensible Markup Language (XML) nimmt hier eine spezielle Rolle ein, da viele aktuelle und zukünftige Standards im Cloud Computing auf XML-basierten Protokollen aufbauen. XML steht deshalb im Zentrum dieser Arbeit. XML-basierte Protokolle sind typischerweise im Industriestandard XML-Schema (XSD) spezifiziert. Schema-Validierung ist daher eine erste Instanz um syntaktisch unerwünschte Protokollnachrichten zu filtern. Schemas sind in manchen Protokollen jedoch nicht zwingend erforderlich und daher möglicherweise nicht verfügbar, und sogenannte Erweiterungspunkte haben sich in XSD für die lose Komposition von Schemas durchgesetzt. Schema-Erweiterungspunkte sind Platzhalter, die in vielen Protokollspezifikationen auftreten und letztendlich Schema- Validierung aushebeln. Konkret kann ein Angreifer trotz Validierung beliebigen Inhalt in einem XML-Dokument platzieren. Das kann für verschiedene Angriffe ausgenützt werden, .B. für einen Signature-Wrapping-Angriff. In den letzten Jahren wurden mehre- re erfolgreiche Signature-Wrapping-Agriffe auf Cloud-Management-Schnittstellen und Identitätsprovider demonstriert. Syntaktische Validierung von Dokumenten könnte den Signature-Wrapping-Angriff verhindern, wenn keine Erweiterungspunkte in der Sprach- repräsentation sind. Diese Dissertation beschreibt einen Sicherheitsmonitor für sprachbasierte Anoma- lieerkennung in XML-basierter Interaktion. Der Sicherheitsmonitor hat eine Lern- und eine Validierungskomponente. Die Lernkomponente erlernt einen Automaten aus syntak- tisch akzeptablen Protokollnachrichten, und die Validierungskomponente benützt diesen Automaten, um syntaktisch nicht-akzeptable Protokollnachrichten in der Interaktion zu identifizieren. Das Lernverfahren konzentriert sich ausschließlich auf Positivbeispiele, da XML-Angriffe sehr servicespezifisch sind und dadurch die Verfügbarkeit von Gegenbei- spielen üblicherweise nicht gegeben ist. Der erste Beitrag umfasst Erweiterungen des XML Visibly Pushdown Automaten (XVPA) mittels Datentypen (dXVPAs) als Sprachdarstellung von mixed-content XML in der Lernkomponente. Ein dXVPA lässt sich wiederum in einen character-data XVPA (cXVPA) übersetzen, welcher für effiziente Stream-Validierung in der Validierungskom- ponente herangezogen wird. Der zweite Beitrag ist ein lexikalisches Datentypensystem für das Generalisieren von Textinhalten in mixed-content XML durch Datentypen. Pas- sende Datentypen für einen Textinhalt werden durch lexikalische Subsumtion und einer Präferenzheuristik ermittelt. Der dritte Beitrag umfasst Algorithmen für einen schrittwei- sen und mengenbasierten Lerner. Um den praktischen Umgang mit Poisoning-Angriffen in realistischen Umgebungen zu erleichtern, hat der Lerner zusätzliche Fähigkeiten: er kann bereits gelernte Beispiele wieder vergessen, und durch das Entfernen von wenig frequentierten Zuständen und Zustandsübergängen können versteckte Poisoning-Angriffe bereinigt werden.

vii Ein Softwareprototyp wurde experimentell in zwei synthetischen und zwei simulierten Szenarios evaluiert. Mithilfe eines Apache-Axis-2-Webservices und durch Simulation von aktuellen XML-Angriffen wurden realistische Daten erzeugt. Die Experimente zeig- ten vielversprechende Ergebnisse. Der Lerner benötigte in allen Szenarien nur wenige Beispiele für die Konvergenz zu einer stabilen Sprachrepräsentation. Des Weiteren waren die Erkennungsraten in allen Szenarien zwischen 82,35% und 100%, frei von Falschalar- men und übertrafen traditionelle Schema-Validierung. Poisoning-Angriffe konnten durch gezieltes Vergessen und Bereinigen erfolgreich entfernt werden. Anwendungsfälle für die Integration des vorgeschlagenen Sicherheitsmonitors wären eine Middleware-Sicherheitskomponente, eine Anomalieerkennungskomponente in einer XML Firewall, und ein klientenseitiges Browser Plug-In für die Analyse von XML- basierten Webressourcen.

viii Acknowledgments

First and foremost, I wish to thank my advisor Prof. Dr. Klaus-Dieter Schewe for his excellent guidance, continuous support, and patience over the last four years. He has taught me the rigorous way, sparked my interest in theoretical computer science, and given me the scientific freedom to pursue my ideas. Becoming a scientist under your supervision was a great experience, and I will forever be thankful to you. I would also like to express my gratitude to Prof. Dr. Joachim Biskup who kindly agreed to act as a co-advisor and to examine my work. Thank you for having given me the opportunity to discuss my research at TU Dortmund and for the invaluable feedback at an important stage of my dissertation project. Many thanks also go to fellow labmates and friends, Dr. Károly Bósa, Andreea Buga, Ursula Haiberger, Roxana Holom, Tania Nemes, Mariam Rady, Mircea Boris Vleju, and Ciprian Zavoianu.˘ Thank you for all the fruitful discussions, the funny moments, the nice atmosphere in the laboratory, and all the mental support during writing this thesis. Furthermore, I would like to thank my friends Matthias Pfötscher, Florian Wex, and Philipp Winter for the inspiring discussions in- and outside academia. Lastly, and most importantly, I send my sincerest gratitude to Verena, for all your love and support throughout my doctoral studies. My deepest gratitude goes to my loving parents Maria and Alois, who enabled me to pursue an academic education and helped me through all the years, and to my sisters Bettina, Barbara, Karin, and Andrea for the moral support. This work is dedicated to you.

ix x Contents

Sworn Declaration iii

Abstractv

Kurzfassung vii

Acknowledgments ix

Abbreviations xix

1 Introduction1 1.1 Motivation...... 3 1.1.1 Language-Theoretic Security...... 4 1.1.2 XML Attacks and Schema Validation...... 6 1.2 Objectives...... 8 1.3 Outline...... 9 1.4 Contributions...... 9

2 Background: Interaction and Monitoring 11 2.1 Client-Cloud Interaction...... 11 2.1.1 Historical Context...... 12 2.2 Languages for Content and Media...... 13 2.2.1 Data Serialization Formats...... 13 2.2.2 Container Formats...... 14 2.3 Service Interaction Patterns...... 14 2.3.1 Single-Transmission Bilateral Patterns...... 15 2.3.2 Single-Transmission Multilateral Patterns...... 15 2.3.3 Multi-Transmission Patterns...... 15 2.3.4 Routing Patterns...... 16 2.4 Protocols for Computer Networks...... 16 2.4.1 Transport Layer Protocols...... 17 2.4.2 Application Layer Protocols...... 19 2.5 Cloud Interaction Architectures...... 22 2.5.1 Web Architectures...... 23 2.5.2 Service Architectures...... 25 2.5.3 Cloud Computing Aspects...... 33 2.6 Monitoring of Client-Cloud Interaction...... 34 2.6.1 Cloud Monitoring...... 35 2.6.2 Observation Points...... 36

xi 2.7 Implications of Technological Trends...... 40

3 Literature Review 43 3.1 Anomaly-Based Intrusion Detection...... 43 3.1.1 Network Layer Anomaly Detection...... 45 3.1.2 Operating System Layer Anomaly Detection...... 46 3.1.3 Service, Middleware, and User Layer Anomaly Detection... 46 3.1.4 Language-Theoretic View on Intrusion Detection...... 47 3.1.5 Intrusion Detection Evasion...... 49 3.1.6 Kernel-Based Anomaly Detection in Tree-Structured Data... 49 3.1.7 XML Anomaly Detection...... 53 3.2 The Extensible Markup Language...... 53 3.2.1 Namespaces...... 54 3.2.2 Parsing...... 54 3.2.3 Schema Languages...... 55 3.2.4 XPath...... 58 3.2.5 Syntactic Optimizations...... 58 3.3 XML Attacks...... 59 3.3.1 Parsing Attacks...... 59 3.3.2 Semantic Attacks...... 61 3.3.3 Schema Extensibility and Effects on Validation...... 62 3.3.4 XML Signature Wrapping Attack...... 63 3.3.5 Language-Theoretic View on XML Attacks...... 68 3.4 Stream Validation...... 68 3.4.1 Visibly Pushdown Automata...... 68 3.4.2 XML Visibly Pushdown Automata...... 69 3.4.3 VPAs in XML Research...... 71 3.5 Schema Inference...... 71 3.5.1 DTD Inference...... 71 3.5.2 XSD Inference...... 73 3.6 Wrapper Inference for Information Extraction...... 75

4 Language-Based Anomaly Detection 77 4.1 Problem Description...... 78 4.1.1 Architecture and Assumptions...... 78 4.1.2 XML Stream Validation...... 79 4.1.3 Grammatical Inference...... 80 4.1.4 Applicability...... 80 4.2 Methodology...... 81 4.2.1 Datatyped Language Representation...... 81 4.2.2 A Lexical Datatype System for Datatype Inference...... 82 4.2.3 Learning from Positive Examples...... 82 4.2.4 Anomaly Detection Refinements...... 84 4.2.5 Experimental Evaluation...... 85 4.3 Use Cases...... 85

xii 5 Grammatical Inference of XML 87 5.1 Document Event Stream Representation...... 87 5.2 Models for Datatypes and Text Content...... 88 5.2.1 Datatyped EDTD...... 88 5.2.2 Datatyped XVPA...... 90 5.2.3 Character-Data XVPA for Stream Validation...... 92 5.3 A Lexical Datatype System for Datatype Inference...... 93 5.3.1 Lexical Subsumption...... 94 5.3.2 Linear-Time Lexical Subsumption Computation...... 95 5.3.3 Preference Heuristic...... 97 5.3.4 Datatyped Event Stream for Learning...... 98 5.3.5 Combining Datatypes and Incremental Update...... 98 5.4 Learning from Positive Examples...... 99 5.4.1 The k-Testable Regular Languages for Element Locality.... 100 5.4.2 Schema Typing Mechanisms for Type Contexts...... 101 5.4.3 Visibly Pushdown Prefix Acceptor...... 103 5.4.4 State Merging for Generalization...... 107 5.4.5 Generating dXVPAs...... 109 5.4.6 Parameter Semantics...... 112 5.5 Learner Properties...... 113 5.5.1 Computational Complexity...... 114 5.6 Anomaly Detection Refinements...... 115 5.6.1 Event Stream Security Heuristics...... 115 5.6.2 Filtering...... 116 5.6.3 Unlearning and Evolution...... 116 5.6.4 Sanitization...... 120

6 Experimental Evaluation 123 6.1 Implementation...... 123 6.2 Measures...... 123 6.2.1 Detection Performance...... 124 6.2.2 Learning Progress...... 125 6.2.3 Operational Performance...... 125 6.3 Evaluation Scenarios...... 126 6.3.1 Synthetic Datasets...... 127 6.3.2 Simulated Datasets...... 127 6.4 Performance Results...... 128 6.4.1 Detection Performance...... 128 6.4.2 Learning Progress...... 131 6.4.3 Operational Performance...... 132 6.4.4 Unlearning...... 133 6.4.5 Sanitization...... 133

7 Conclusions 137 7.1 Summary...... 137 7.2 Objectives...... 138 7.3 Assumptions, Restrictions, and Design Choices...... 139 7.4 Discussion of Results...... 141

xiii 7.5 Open Questions...... 142 7.5.1 Extended and Comparative Evaluation...... 142 7.5.2 Modeling and Learning Improvements...... 143 7.5.3 Query Learning...... 144 7.5.4 Architectural Aspects...... 144

A XML Schema Datatype Hierarchy 145

B Additional Experiments 147

Bibliography 157

Curriculum Vitae 187

xiv List of Figures

1.1 A client and a service exchange messages over some transport mechanism in an agreed communication protocol for interaction...... 1 1.2 LangSec threat model...... 4 1.3 Relationships of the relevant sets of inputs and their approximations by language-based intrusion detection methods (reproduced from [58, p. 357]) 5 1.4 Classical XML signature wrapping attack [393]...... 7

2.1 Concepts and relationships identified in communication technologies226 [ ] 11 2.2 Service interaction patterns for messaging between parties [37, 38]... 15 2.3 The Internet model [67] and client-cloud communication protocols... 16 2.4 An URL identifies and locates a resource...... 19 2.5 Web- and service-oriented view on cloud interaction architectures [226] 23 2.6 Publish-subscribe architectures for multilateral interaction...... 28 2.7 Architectures for message-oriented middleware [104]...... 30 2.8 High-level components of a monitoring system (based on [58, p. 359]). 34

3.1 Anomaly detection errors in a binary classification setting with respect to the relevant sets of behaviors, states, or inputs (reproduced from [58, p. 357])...... 48 3.2 Vector space embeddings of a tree [355, 356]...... 50 3.3 Global and local anomaly detection in feature space [359]...... 51 3.4 XML document and it’s tree representation...... 54 3.5 XML parsing steps...... 55 3.6 Schema extension points in the SOAP XSD schema...... 63 3.7 Informal grammar of XML Signature taken from the specification [453] 64 3.8 The three types of XML Signature [453]...... 65 3.9 XPathFilter2 expression for referencing elements in XML Signature.. 65 3.10 Examples for XML signatures in SAML assertions...... 66 3.11 An exemplary XVPA and an accepted document event stream...... 71 3.12 PTA constructed from examples ac, abbcd, abc, acddd, abbbcd and { } annotated by frequencies...... 73 3.13 Lexical-subsumption datatype inference reproduced from related work. 74

4.1 Proposed language-based anomaly detection architecture...... 78 4.2 Value and lexical spaces of XSD datatypes...... 82 4.3 Learning from positive examples...... 83 4.4 The proposed learning process...... 84 4.5 CCIM architecture and components...... 86

5.1 Symbolic StAX event stream...... 88

xv 5.2 A dXVPA accepting the same documents as the dEDTD in Example4. 91 5.3 Ordering on lexically distinct XSD datatypes...... 94 ≤lex 5.4 Ordering on kinds of lexical datatypes...... 98 ≤s 5.5 Symbolic datatyped event stream...... 98 5.6 PTA and corresponding 2-testable DFA...... 101 5.7 Typing of an element...... 102 5.8 VPPA construction examples...... 106 5.9 State merging with parameters k = 1, = 2 for the example in Figure 5.8d 108 5.10 Generated dXVPA from the VPA in Figure 5.9...... 112

6.1 Two dXVPAs inferred from dataset Carsale...... 130 6.2 Learning progress highlights...... 132 6.3 Effects of increased k and l for dataset VulnShopOrder...... 133 6.4 Effects of multiple poisoning attempts and delayed unlearning..... 134 6.5 Effects of a single poisoning attempt and later sanitization...... 135

7.1 Learning from positive examples and queries...... 144

A.1 XML Schema Definition Language (XSD) 1.1 Part 2: Datatypes [463]. 145

B.1 Additional unlearning and sanitization experiments...... 147 B.2 Carsale learning progress using ancestor-based states...... 148 B.3 Carsale learning progress using ancestor-sibling-based states...... 149 B.4 Catalog learning progress using ancestor-based states...... 150 B.5 Catalog learning progress using ancestor-sibling-based states...... 151 B.6 VulnShopOrder learning progress using ancestor-based states...... 152 B.7 VulnShopOrder learning progress using ancestor-sibling-based states.. 153 B.8 VulnShopAuthOrder learning progress using ancestor-based states... 154 B.9 VulnShopAuthOrder learning progress using ancestor-sibling-based states 155

xvi List of Tables

2.1 Interaction patterns in communication protocols [226]...... 22 2.2 Interaction patterns in web architectures [226]...... 25 2.3 Interaction patterns in service architectures [226]...... 33

6.1 The binary confusion matrix for classification...... 124 6.2 Properties of the evaluation datasets...... 127 6.3 Baseline detection performance using schema validation...... 128 6.4 Detection performance highlights...... 129

xvii xviii Abbreviations

1PPT 1-Pass Preorder Typing

AJAX Asynchronous JavaScript and XML

AMQP Advanced Message Queuing Protocol

API Application programming interface

ASCII American Standard Code for Information Interchange

ASM Abstract State Machine

ASN.1 Abstract Syntax Notation One

BOSH Bidirectional-streams over synchronous HTTP

BPEL Business Process Execution Language

BPMN Business Process Model and Notation

CCIM Client-Cloud Interaction Middleware

CDATA Unparsed character data

CHARE Chain regular expression

CoAP Constraint Application Protocol

COM+ Microsoft Component Services

CORBA Common Object Request Broker Architecture

CORS Cross-origin resource sharing

CSP Content Security Policy

CSS Cascading Style Sheets cXVPA Character-data XVPA

DCCP Datagram Congestion Control Protocol

DCFL Deterministic context-free language

DDS Data Distribution Service for Real-Time Systems

DDoS Distributed Denial of Service

xix dEDTD Datatyped EDTD

DFA Deterministic finite automaton

DIME Direct Internet Message Encapsulation

DNS

DOM Document Object Model

DPI Deep Packet Inspection

DTD Document Type Definition

DTLS Datagram Transport Layer Security dXVPA Datatyped XVPA

ECFG Extended context-free grammar

EDC Element Declarations Consistent

EDSM Evidence-driven state merging

EDTD Extended Document Type Definition

EDTDst Single-type EDTD

EDTDrc Restrained-competition EDTD

FTP File Transfer Protocol

FQDN Fully qualified domain name

HTML Hypertext Markup Language

HTTP Hypertext Transfer Protocol

HTTPS HTTP Secure

IaaS Infrastructure-as-a-Service

IANA Internet Assigned Numbers Authority

IDL Interface definition language

IDPS Intrusion detection and prevention system

IDREF Identifier-based reference

IDS Intrusion detection system

IIOP Internet Inter-ORB Protocol i-ORE i-occurrence regular expression

IP Internet Protocol

xx IPS Intrusion prevention system

IPv6 Internet Protocol Version 6

JSON JavaScript Object Notation

LangSec Language-Theoretic Security

MDL Minimum description length

MIME Multipurpose Internet Mail Extensions

MML Minimum message length

MOM Message-oriented middleware

MPTCP MultiPath TCP

MQTT Telemetry Transport

MSMQ Microsoft Message Queue

MTOM Message Transmission Optimization Mechanism

NFA Nondeterministic finite automaton

OWL Web Ontology Language

PaaS Platform-as-a-Service

PCDATA Parsed character data

PTA Prefix tree acceptor

QUIC Quick UDP Internet Connections

REG The class of regular languages

RDF Resource Description Framework

REST Representational State Transfer

RPC

RPNI Regular Positive and Negative Inference

RSS Rich Site Summary

RTPS Real-Time Publish-Subscribe

RUDP Reliable UDP

SaaS Software-as-a-Service

SAML Security Assertion Markup Language

SAX Simple API for XML

xxi SCTP Stream Control Transmission Protocol

SIEM Security information and event management

SGML Standard Generalized Markup Language

SMTP Simple Mail Transfer Protocol

SNMP Simple Network Management Protocol

SOAP Simple Object Access Protocol

SOCKS Sockets Secure

SORE Single-occurrence regular expression

SOXSD Single-Occurrence XSD

SQL Structured Query Language

SSE Server-Sent Events

SSL Secure Sockets Layer

SSRF Server-side request forgery

StAX Streaming API for XML

STOMP Streaming Text Oriented Messaging Protocol

SVDD Support Vector Data Description

SVM Support Vector Machine

TCP Transmission Control Protocol

TLS Transport Layer Security

UDDI Universal Description, Discovery, and Integration

UDP User Datagram Protocol

UPA Unique Particle Attribution

URI Uniform resource identifier

URL Uniform resource locator

VPA Visibly pushdown automaton

VPL Visibly pushdown language

VPPA Visibly pushdown prefix acceptor

VTD Virtual Token Descriptor

W3C World Wide Web Consortium

xxii WADL Web Application Description Language

WebRTC Web Real-Time Communications

WSDL Web Service Description Language

WS-* Web service standards and specifications

XDR External Data Representation

XML Extensible Markup Language

XMPP Extensible Messaging and Presence Protocol

XOP XML-binary Optimized Packing

XPath XML Path Language

XRDL XML-RPC Description Language

XSD XML Schema

XSLT Extensible Stylesheet Language Transformations

XSRF Cross-site request forgery

XSS Cross-site scripting

XVPA XML visibly pushdown automaton

xxiii xxiv Chapter 1

Introduction

Cloud computing [32] is a paradigm for providing computational resources, e.g., hardware, storage, and software, as an Internet-accessible service to consumers. A pay-by-use business model and a highly automated multi-tenant architecture on the provider side are key characteristics of a cloud, so a client can provision resources on demand from a sheer endless pool. This so-called elasticity enables a client to rapidly deploy services and to scale resource usage. The National Institute of Standards and Technology [256] has defined widely accepted deployment and service delivery models to categorize clouds. Deployment models include private, community, public, and hybrid clouds, and typical service delivery models are Infrastructure-as-a-Service (IaaS), Platform-as-a- Service (PaaS), and Software-as-a-Service (SaaS). Interaction between distributed service providers and consumers needs communication over a computer network, i.e., the Internet, as illustrated in Figure 1.1.A service is an IT resource that is made accessible through an interface by some system in the cloud [166, 336]. A service consumer system on the client side, i.e., a client, is typically operated by a user for accessing a service. Client and service are considered heterogeneous systems that participate in an architecture and exchange messages in an agreed communication protocol. For service-to-service interaction in a composition, a service participates as a client to consume other services. A service can also coordinate two clients for client- to-client or peer-to-peer interaction. On a technological level, cloud computing benefits from well-established web and enterprise service technologies [226]. However, by consuming a cloud service, a client becomes dependent on the ser- vice provider. As cloud computing accumulates data on the provider side, clouds and their clients become attractive targets for attackers. In particular, the Cloud Security Alliance [93] has identified nine security threats to cloud computing: data breaches, where

User Message System System Protocol Role: client Role: service Message

Interface Figure 1.1: A client and a service exchange messages over some transport mechanism in an agreed communication protocol for interaction

1 CHAPTER 1. INTRODUCTION an attacker gains access to client data; tampering or loss of data; hijacking of accounts or service communication; insecure application programming interfaces (APIs) for service access, management, and orchestration; Denial of Service; malicious insiders; abuse of cloud services for attacks; insufficient due diligence; and shared technology vulnerabili- ties in multi-tenant architectures. The increasing need for cloud security coincides with a strong momentum in cloud security research as pointed out by Fernandes et al. [123].

Monitoring and Intrusion Detection One way of raising assurance that security properties such as availability, integrity, and confidentiality are met is monitoring [58, ch. 11]. In particular, intrusion detection is a variant of monitoring for identifying attacks in a computer system or network [378]. An intrusion detection system (IDS) assumes that evidence or symptoms are observable in case of an attack, and detection methods are distinguished into misuse- and anomaly-based methods. A misuse-based IDS monitors for known patterns of “undesirable accesses”, i.e., so-called attack signatures. On the other hand, an anomaly-based IDS assumes that an attack causes anomalies with respect to a reference model of “acceptable usages”, which is derived from a specification or learned from observations. Misuse-based detection identifies known attacks but fails when an attack is unknown (i.e., a zero-day attack). Bilge and Dumitras [55] have studied attacks in malicious software (i.e., malware) in the time between 2008 and 2011, and results indicate that zero-day attacks are more frequent than previously thought. Furthermore, polymorphism and obfuscation can customize attacks for evading signatures [396]. Anomaly detection can, in theory, identify zero-day and obfuscated attacks but learning-based approaches suffer from inherent problems, especially the cost of false alarms. The distribution of normal observations and attacks is unknown and could be heavily skewed; even an anomaly-based IDS with a very low false-alarm rate could still generate an unacceptable number of false alarms [34]. Also, the presence of an anomaly does not always imply an attack, training data for learning-based approaches could be contaminated with zero-day attacks, and the notions of “acceptable usages” and “undesirable accesses” of the system under observation can evolve over time [150, 392].

Language-Based Anomaly Detection in XML This thesis addresses language-centric monitoring of messages received at insecure client or service interfaces. The focus is on the Extensible Markup Language (XML) [451] because it is a platform-independent data serialization format used in many cloud com- munication protocols, e.g., the Simple Object Access Protocol (SOAP) [8, 447] for traditional web services, the Extensible Messaging and Presence Protocol (XMPP) [186, 369] for message-oriented middleware in clouds, the Security Assertion Markup Lan- guage (SAML) [304] for cloud single sign-on access control, and as a data serialization format in Representational State Transfer (REST) [129, 130] web service architectures. XML is a complex language, and XML-based protocols are susceptible to entire classes of implicit and explicit security problems [202, 227]. XML-based protocols are typically specified in the industry-standard XML Schema (XSD) [443], and like a grammar, a schema characterizes a set of XML documents. Schema validation then checks whether a document is generated by a particular schema. Validation at the client or service interface should be a first line of defense against “undesirable protocol messages”. But XML and XSD suffer from two fundamental problems:

2 1.1. MOTIVATION

• XML is usually treated as a tree-based format, but in many protocols this is not the case because of integrity constraints. For example, ID-based references in a document entail a graph structure, and processing and validating a document correctly become computationally harder.

• Best practices in XSD recommend extension points in schemas for loose compo- sition [400]. But extension points enable an attacker to add arbitrary unchecked content in a document, e.g., for a signature wrapping attack [255].

An XML attack manifests in a document’s structure or in text contents of elements. I propose a language-based anomaly detection approach in terms of an interface-centric security monitor on the client or service side. The contributions are XML language repre- sentations for the learner and validator components in the security monitor, algorithms for incremental and set-driven learning of syntactically acceptable messages, and an experimental evaluation of the proposed monitor in four scenarios. For representing sets of documents, XML visibly pushdown automata (XVPAs) [10, 223] are extended toward datatyped XVPAs (dXVPAs) for mixed-content XML, where text content is allowed between subsequent tags in a document. A dXVPA translates into a character-data XVPA (cXVPA) for efficient stream validation. For learning, all text contents in a document are generalized by minimally required datatypes, and datatypes are inferred by a proposed lexical datatype system based on lexical subsumption and a preference heuristic. XML attacks that undermine semantics, i.e., signature wrapping attacks, are highly client or service specific, and “undesirable examples” are considered unavailable for learning. Therefore, as a first step, the assumed learning scenario is learning from positive examples, where an incremental learner infers a dXVPA for characterizing the “acceptable messages” at a particular client or service interface. A dXVPA is then translated to a cXVPA for efficient validation of observed messages in the proposed security monitor. To minimize errors, i.e., false positives and negatives, this language-based anomaly detection approach respects the language properties of XML by inferring a good-enough approximation of the acceptable language. Poisoning attacks in training data are a threat to learning-based approaches. Therefore, operations for unlearning and sanitization are provided. Unlearning removes a once learned example from a dXVPA, e.g., when a poisoning attack is uncovered at a later time, and sanitization trims low-frequent states and transitions from a dXVPA for removing hidden poisoning attacks under the assumption that poisoning attacks are rare. For a proof of concept, the proposed algorithms have been experimentally evaluated in two different settings: a synthetic scenario, where XML is used as a data serialization format, and a simulated scenario, where state-of-the-art attacks are executed on an actual web service which has been solely implemented for testing purposes.

1.1 Motivation

My thesis is motivated by the state of software security. Despite increased efforts in securing systems and program code, software is still vulnerable and new threats arise everyday [91]. The removal and mitigation of exploitable vulnerabilities is a requirement by users.

3 CHAPTER 1. INTRODUCTION

1.1.1 Language-Theoretic Security Language-theoretic security (LangSec) is an attempt to understand the root causes of soft- ware insecurity [69, 70, 376]. More specifically, LangSec is a design and programming philosophy that focuses on formally correct and verifiable input handling throughout all phases of the software development process [331]. Special emphasis is given to secure parsing because parsing happens in practically every program that manipulates data, and the majority of attacks exploit weaknesses in input handling [68]. Figure 1.2 shows the LangSec threat model, and the illustrated entities are defined in accordance to Biskup [58, ch. 11]. The crossed-out values denote a traditional view: a computing system executes a program that consumes input data and exhibits behavior in the process. A computing system could be software or hardware, but this is not further specified; it is assumed that the possible states and the possible state transitions abstractly define the computing system. For simplicity, the resulting behavior from consuming a particular input is a sequence of state changes, i.e., a trace.A security policy defines “acceptable usages” and “undesirable accesses” in terms of acceptable and violating behavior, i.e., acceptable and violating traces and states. A successful attack then causes violating behavior.

Program Interpreter Behavior Input Program

Figure 1.2: LangSec threat model

However, this simplified view has shortcomings in reality, e.g., a security policy might be incomplete or flawed, the declaration language for policies usually has limited expressiveness to stay decidable, there could be errors and flaws in the program that introduce unknown states and state transitions, and reasoning about non-trivial properties of the program is undecidable in the general case [58, 70, 68]. LangSec embraces an attacker’s view: a computing system parses input, parsing drives the behavior, and an attacker wants to cause violating behavior. In Figure 1.2, the computing system is therefore considered to be an interpreter, and input data is treated as a program for the interpreter. A so-called exploit is then specially crafted input that leads execution toward an insecure trace or state. Accepting good input and rejecting bad input is a MEMBERSHIP decision problem in the LangSec view, and decidability depends on the input language class. Input languages and compositions thereof, e.g., layered protocol design and character encodings, are in fact the foundation for all communication protocols and data formats today. Unfortunately, ad hoc notions of input validity, ambiguous interpretations of informal protocol specifications, mixing of input recognition and logical processing steps in program code, and ungoverned software development (e.g., by adding new features to protocols) can lead to accidentally Turing-complete input languages in the worst case [68, 331]. The LangSec community argues that these accidental increases in input language expressiveness are the root cause of today’s software vulnerabilities. Deciding MEMBER- SHIP can become intractable or undecidable with increasing accidental expressiveness, and a computer system would not be able to reject an exploit even when a specification of the acceptable inputs is available [69, 376]. The expressiveness of the input language therefore needs to be understood and controlled.

4 1.1. MOTIVATION

Classifications of Inputs The “undesirable accesses” and “acceptable usage” are semantic notions for a comput- ing system and must be properly defined. The following classifications are based on Biskup [58, ch. 11]. A security policy syntactically describes semantically acceptable and violating behaviors of a computing system. But a policy declaration language usually has expressiveness and decidability restrictions, and only a minor fraction of syntactically permitted behaviors can be considered semantically acceptable [58, p. 358]. The LangSec philosophy embraces a syntactic characterization of relevant sets of in- puts with respect to semantically acceptable and violating behaviors. Figure 1.3 illustrates the relationships between the relevant sets.

• The possible inputs are all the inputs that can be sent to the computing system.

• The syntactically acceptable inputs are exactly the inputs that result in semantically acceptable behavior when consumed by a computing system.

• Explicitly permitted inputs are syntactically correct inputs with respect to the input language specification in a computing system’s parser. These inputs canbe consumed, and this set strictly contains the acceptable inputs. Explicitly permitted inputs do not necessarily result in semantically acceptable behavior.

• Exploits (attacks) are syntactically violating inputs that cause semantically violating behavior, and a computing system should definitely reject those.

Anomaly detection Misuse detection

Acceptable Violating Explicitly permitted

Possible

Figure 1.3: Relationships of the relevant sets of inputs and their approximations by language-based intrusion detection methods (reproduced from [58, p. 357])

Language-Theoretic Security and Intrusion Detection The LangSec view explains some of the problems encountered in intrusion detection when language-based methods are used. Figure 1.3 also visualizes the relationship between intrusion detection methods and relevant sets of inputs. On the one hand, the violating inputs are usually unknown, and security experts discover exploits over time. On the other hand, the acceptable inputs are often also not exactly known, e.g., when the input language is composed from several protocols or defined ad hoc in program code.

5 CHAPTER 1. INTRODUCTION

An IDS that observes inputs has to classify acceptable and violating inputs efficiently. Algorithms are usually engineered for a tractable language class, and models of acceptable and violating inputs are therefore approximations. While anomaly detection approximates the acceptable inputs from a specification or by learning, misuse detection approximates violating inputs from discovered exploits and generalizations thereof, e.g., in terms of patterns. The consequences of approximation are however false-positive and false- negative detection errors. Two problems emerge in the worst case when the language classes of an IDS and observable inputs are not equally expressive [376]:

• Some exploits could universally evade detection by a misuse-based IDS, e.g., by obfuscation or polymorphism.

• An anomaly-based IDS might not be able to learn the acceptable inputs even after exhaustive training.

Respecting the language class of inputs is therefore a precondition for accurate intrusion detection, but approximations are inevitable in some cases. In this thesis, the focus is on XML because of its role in cloud computing. XML is a text-based format, where well-matched tags syntactically encode a tree of elements, but the semantic data model of a document is not necessarily a tree because of integrity constraints. These constraints are specified in schemas, and they assign special semantics to attributes, elements, and text contents for allowing more expressive data models. However, there is only little syntactic evidence in documents whether an attribute, element, or text content is actually part of an integrity constraint. The focus in this thesis is therefore on the syntactic structure of elements and text contents only. Key mining in XML is a research field on its own [29]. The proposed language-based anomaly detection approach learns a good-enough approximation of the acceptable documents. Language representation by dXVPAs and cXVPAs is beneficial for efficient stream processing in a monitor but fundamentally an approximation because integrity constraints are ignored, and documents are simply treated as well-matched event streams. This approximation is still good enough for identifying the majority of XML attacks:

• An inferred dXVPA is free of extension points, and an attacker cannot place arbitrary elements in a document anymore.

• When a document syntactically differs from acceptable training examples, e.g., because of violating element structures and text contents, it is rejected.

1.1.2 XML Attacks and Schema Validation XML provides several schema languages, e.g., Document Type Definition (DTD) [451], XSD [443], and Relax NG [292], to specify sets of documents. A schema is like a grammar and defines a language, and a document is valid if all schema production rules are satisfied. Schema validation is therefore a first line of defense against document structures and text contents that are not explicitly permitted. However, there are two observations that motivate a learning-based approach. First, RESTful web services are popular for SaaS implementations, but they do not enforce the presence of a schema nor validation of an XML-based resource. The input language is eventually defined ad hoc in the program code. Second, extension points in XSD

6 1.1. MOTIVATION are in fact wildcards, and they are found in practically all protocol specifications today, e.g., SOAP, XMPP, and SAML. When extension points are present, schema validation is rendered ineffective as an attack countermeasure because an element at an extension point is just skipped. Various parsing and semantic XML attacks can therefore be placed while the document stays explicitly permitted. Moreover, checking integrity constraints is costly because they change the data model of a document. In particular, sequential and cyclic references can turn the data model into an infinite tree, and checks and operations become computationally harder [489]. The XML signature wrapping attack [255] is a consequence of these problems. XML Signature [453] specifies a cryptographic signature for the text representations of oneor more elements in a document. A signature is stored in its designated element, and as shown in Figure 1.4a, a reference points to the signed part. The vulnerability is basically a discrepancy between the cryptographic verification and the business logic in the receiving system. Figure 1.4b illustrates an example of an attacker’s message. As a prerequisite, the attacker needs to have access to a message already signed by another user, e.g., from network sniffing and crawling cloud service provider support forums[393].

soap:Envelope soap:Header wsse:Security soap:Envelope ds:Signature soap:Header ds:SignedInfo wsse:Security ds:Reference ds:Signature Wrapper @URI #123 ds:SignedInfo soap:Body ds:Reference @wsu:Id 123 @URI #123

verified MonitorInstances soap:Body soap:Body @wsu:Id 123 @wsu:Id attack MonitorInstances CreateKeyPair verified, processed processed (a) Original XML message (b) Modified XML message

Figure 1.4: Classical XML signature wrapping attack [393]

For the attack, the signed element is moved into a wrapper element at an extension point to evade schema validation, e.g., the Wrapper element in the figure, and the attacker places a violating element at the original position instead. Due to the discrepancy in the service’s implementation, the cryptographic verification succeeds, but the business logic interprets the violating element. Nonetheless, the originally signed element must remain in the document and causes anomalous structure. Somorovsky et al. [394] have demonstrated a signature wrapping attack on SAML. In SAML, an identity provider signs a client’s authentication and authorization claims, and a successful attacker is able to impersonate other users or issue arbitrary claims. The majority of publicly available SAML implementations were vulnerable because of the extension points in the SAML schema. Jensen et al. [204] show that schema validation can be an effective defense against signature wrapping attacks but requires a hardened schema without extension points. All schemas in a service must be known beforehand, so they can be unified into a single schema that represents the acceptable documents. The unification process is however

7 CHAPTER 1. INTRODUCTION computationally hard because of combinatorial effects from embedding schemas at extension points, and restrictions are needed. Also, the large size of a hardened schema imposes a penalty on validation performance in the experiments [204]. The proposed learning-based approach circumvents the problems encountered in schema hardening by inferring a language representation of acceptable documents directly from examples.

1.2 Objectives

Objective 1. The first objective is to review the state-of-the-art in today’s technological landscape for client-cloud interaction and to understand the interplay between languages, protocols, architectures, and service interaction patterns in cloud service integration. This knowledge is the foundation for understanding today’s attack surfaces, and a result of this objective is the importance of XML and XML-based protocols in cloud computing.

Objective 2. Monitoring of a computing system is an approach to gain insight about correctness in execution and security. Language-based anomaly detection must be capable of monitoring the exchanged messages in an interaction. To understand the technical feasibility, the second objective is therefore to review the state-of- the-art in client-cloud interaction monitoring and, most importantly, to identify the available observation points and their limitations.

Objective 3. The third objective has three sub-objectives. First, the state-of-the-art in anomaly-based intrusion detection needs to be studied to understand the existing methods and their limitations. Second, the proposed approach focuses on XML as a result of Objective 1, and XML attacks therefore need to be reviewed. Third, it should be clear what countermeasures are currently available for protecting against XML attacks. A particular result of this objective is that stream validation without extension points can mitigate several XML attacks, and XVPAs are a suitable foundation for representing the acceptable inputs in terms of document event streams.

Objective 4. Specifying the proposed security monitor and its components, including the necessary language representations and algorithms, is the fourth objective. First, XVPAs need to be extended for text contents in mixed-content XML because text contents are not modeled in XVPAs. Second, text contents can be from an arbitrary language class and affect the learning process. A generalization of text contents is therefore necessary. Third, algorithms for learning from positive examples have to be specified. Fourth, the learner has to operate in a potentially adversarial environment, where an attacker might be able to poison the training data. Additional mechanisms for dealing with poisoning attacks are therefore necessary. The results of this objective is an incremental and set-driven learner and operations for unlearning and sanitization in case of poisoning attacks.

Objective 5. A learned automaton from XML documents is a heuristic, and the fifth objective is to deliver a proof of concept. First, the algorithms for the proposed security monitor need to be implemented. Second, appropriate experiments need

8 1.3. OUTLINE

to be defined. This includes test cases, where the state-of-the-art XML attacks are simulated, and effectiveness of the monitor can be measured.

1.3 Outline

This thesis is structured as follows. Chapter2 investigates the background by summa- rizing the state-of-the-art technologies for client-cloud interaction, the motivation for client-cloud interaction monitoring, and available observation points for implementing a monitoring solution (Objective 1 and 2). The literature review in Chapter3 surveys intrusion detection and, in particular, anomaly-based intrusion detection as security monitoring strategies. Kernel methods in machine learning are considered the state-of-the-art for anomaly detection in tree- structured data and are therefore explained in greater detail. Furthermore, XML and XML attacks are discussed, and related work on stream validation and schema inference is presented (Objective 3). The approach towards language-based anomaly detection in XML is introduced in Chapter4 by proposing a security monitor architecture and its components, defining an attacker model, stating the domain-specific problems, introducing the methodology with respect to grammatical inference, and discussing use cases for a learning-based security monitor (Objective 4). Chapter5 is the main contribution, in particular, the definition of dXVPAs and cXVPAs as extensions of XVPAs for validating mixed-content XML. Text content is generalized by minimally required datatypes, and an incremental and set-driven learner for inferring a reference model of acceptable inputs in terms of a dXVPA is specified. For dealing with poisoning attacks during the learning process, operations for unlearning and sanitization are proposed (Objective 4). The learner and validator components of the proposed security monitor are experimen- tally evaluated in Chapter6 for a proof of concept. Remarks on the implementation are given, and measures for effectivity are defined. The implementation has been evaluated in two synthetic and two simulated scenarios, where state-of-the-art XML attacks were executed, and results are presented (Objective 5). Chapter7 concludes this thesis by elaborating on the experimental evaluation results. The treatment of the objectives is summarized, and for completeness, all assumptions, restrictions, and design choices in the proposed approach are recalled. The chapter closes by stating open research questions for future research opportunities. The reader should be familiar with the fundamentals in computer security [58] and basic concepts in formal language theory [216], in particular, regular languages, regular expressions, and deterministic finite automata (DFAs).

1.4 Contributions

The main contributions in this thesis are: 1. the dXVPA and cXVPA language representations for mixed-content XML and efficient stream validation

2. a lexical datatype system for inferring a set of minimally required datatypes from text content based on lexical subsumption and a preference heuristic

9 CHAPTER 1. INTRODUCTION

3. an incremental and set-driven learner with unlearning and sanitization capabilities for inferring a dXVPA from example documents

The learner and validator components of the proposed security monitor have been implemented and experimentally evaluated. The proof of concept evaluation uses four generated datasets representing two synthetic and two simulated scenarios. While the synthetic scenarios were generated using the ToXGene tool [35], the simulated datasets were collected from a SOAP/WS-* web service called VulnShopService which was also implemented in the course of this thesis. State-of-the-art attacks were generated manually and automatically by the WS-Attacker [253] tool. The proposed approach was able to achieve promising results on all four datasets compared to the baseline performance of traditional schema validation. Only few examples were needed for the learner to converge to a stable dXVPA representation, and attack detection rates were between 82.35% and 100% without false positives in all experiments.

Publications. All the main results and contributions presented in this thesis have been published or are accepted at the time of writing:

• Lampesberger, H.: An incremental learner for language-based anomaly detection XML. In: 2016 IEEE Symposium on Security and Privacy Workshops, LangSec’16. IEEE (2016). (To appear)

• Lampesberger, H., Rady, M.: Monitoring of client-cloud interaction. In: Correct Software in Web Applications and Web Services, Texts & Monographs in Symbolic Computation, pp. 177–228. Springer International Publishing (2015)

• Lampesberger, H.: Technologies for web and cloud service interaction: A survey. Service Oriented Computing and Applications (2015). DOI: 10.1007/s11761-015- 0174-1

• Lampesberger, H.: A grammatical inference approach to language-based anomaly detection in XML. In: 2013 International Conference on Availability, Reliability and Security, ECTCM’13 Workshop, pp. 685–693. IEEE (2013)

10 Chapter 2

Background: Interaction and Monitoring

Cloud clients and services operate in different jurisdictions, and both parties need to com- municate for interaction. Monitoring of observable input-output behavior is a well-known approach in computer security, e.g., network firewalls, and the LangSec threat model regards the language for communication as a major origin of vulnerabilities. However, today’s cloud computing is more of a paradigm than a defined set of technologies, and a wide range of languages and compositions thereof are available for integration. This chapter provides a detailed review of technologies that enable client-cloud interaction and investigates how interaction can be monitored. Section 2.1–2.5 summarize standardized languages, conceptual service interaction patterns, communication protocols, and interaction architectures for clouds respectively. The presented study indicates that XML is a foundation for data serialization and com- munication protocols, and monitoring of XML is therefore a viable approach under the LangSec threat model. Section 2.6 then analyzes the goals and limitations in client-cloud interaction monitoring, and technically feasible observation points are reviewed. Sec- tion 2.7 closes the chapter by drawing conclusions from technical trends, in particular, pervasive encryption and multihoming render traditional network security tools, e.g., IP firewalls, incapable of observing interaction. A future-proof monitor therefore needsto operate as a client or service component or in a middleware.

2.1 Client-Cloud Interaction

Figure 2.1 highlights the identified relationships between languages, service interaction patterns, protocols, architectures, and implementations. A language is essential for encoding information in a transportable format and defines an , syntax, and

Service interaction pattern

Protocol Architecture

Language Implementation

Figure 2.1: Concepts and relationships identified in communication technologies [226]

11 CHAPTER 2. BACKGROUND: INTERACTION AND MONITORING semantics. Structured data or multimedia content is represented as a message or sequence of messages in a particular language, and standards for cloud computing are covered in Section 2.2. In literature, interaction of services is primarily discussed in terms of patterns [1, 38, 37, 184, 496]; however, today’s implementations in clients and clouds often resort to ad hoc specifications, and standardization efforts are usually community-driven. Section 2.3 recalls the conceptual service interaction patterns by Barros et al. [38, 37] as a taxonomy for further discussions on protocols and architectures in later sections. For a message transport mechanism, a communication protocol defines the rules of engagement, i.e., low-level interaction patterns, for a language. A protocol can also integrate other protocols into a so-called protocol stack to achieve sophisticated transport mechanisms, e.g., order and delivery guarantees. Section 2.4 reviews communication protocols that are currently in use for cloud computing. An architecture for client-cloud interaction then specifies supported transport mecha- nisms, languages for message formats, and high-level interaction patterns for message exchange. An architecture is in a sense a blueprint for service delivery and implemented in software by consumers and providers. Section 2.5 investigates popular architectures for cloud integration.

2.1.1 Historical Context From a historical point of view, the evolution of the web and enterprise services converges in today’s technologies for client-cloud interaction. Enterprise architectures to access and share network-based resources and use remote functionality in a program go back to the 1970s [484]. In the 1980s, the remote procedure call (RPC) framework by Birrell and Nelson [57] introduces location transparency for network-accessible procedures, where a shared type system couples programs tightly. Object orientation and RPC have led to distributed objects [310] in the 1990s that still need tight coupling, and scalability suffers from communication complexity. between loosely- coupled services became popular at the end of the 1990s as a more scalable enterprise architecture which ultimately led to today’s message-oriented middleware and service- oriented architecture (SOA) [303]. In the meantime, the foundation for a World Wide Web of interlinked nonlinear text, i.e., hypertext, was established by Berners-Lee [468] in 1989, and in 1997, the first Hypertext Markup Language (HTML) recommendation was announced. The web has evolved since then from simple hypertext exchange towards rich client applications, user-provided content, web mashups, and social platforms [120]. In the last decade, web applications have begun to integrate enterprise service paradigms and, on the other hand, the Hypertext Transfer Protocol (HTTP) and web paradigms have literally been abused for creating enterprise services that are Internet and web compatible [72, 336]. Today, cloud computing integrates technologies from both enterprise services and the web to deliver a service across heterogeneous devices and platforms using widely accepted protocols and formats [32]. However, the definition of the term “service” varies in literature [380]. In this thesis, a service is therefore considered as a distributed, network- accessible software component that offers functionality by communicating messages at a specified service interface. The involved peers or parties in client-to-service interaction are a consumer (client) that accesses a service offered by a service provider. For service- to-service interaction, a service adopts the role of a client to access another service. In

12 2.2. LANGUAGES FOR CONTENT AND MEDIA client-to-client interaction, a third-party service typically coordinates two clients, so they can engage directly. However, some control needs to be handed to the service in this case which leads to integration difficulties.

2.2 Languages for Content and Media

An alphabet is a finite set of symbols, and a language is usually an infinite set of strings over an alphabet [185]. Languages for encoding content or media in information exchange are typically referred to as data serialization format, and communicating peers can parse a message if the language’s syntax is well defined. The expressiveness of a language affects the computational complexity of parsing, and with increasing complexity, software implementations become more vulnerable [376]. Digital systems rely on a binary alphabet, and bytes are a common unit of data in computing and Internet applications. Based on the interpretation of bits and bytes, content can be distinguished into text-based and binary. While binary content is a machine- readable representation of information, text-based content is human readable by applying a character encoding to map bits and bytes into human-readable symbols. Popular character encodings are the seven-bit ASCII (American Standard Code for Information Interchange), which is restricted to English text, and ASCII-compatible UTF-8 [494], which encodes human-readable symbols in Unicode [421]. A type is a general concept shared by a set of objects [226]. A content type has a unique identifier and specifies the alphabet, eventually some character encoding, the syntax, and semantics for a language. Content types enable modular software, where an appropriate parsing component for some content is chosen during runtime, e.g., a web browser selects a proper module for data returned by a web server based on the data’s content type. Today, the Multipurpose Internet Mail Extensions (MIME) [137] standardize the notion of a MIME content type in Internet applications, i.e., Internet media type, and an example for UTF-8 encoded text is the MIME type text/plain; charset=utf-8.

2.2.1 Data Serialization Formats The most notable examples for text-based semi-structured languages are HTML, XML, and JavaScript Object Notation (JSON). An HTML document encodes a website, has MIME type text/html, and up to version 4.01 [431], HTML it is an application of the Standard Generalized Markup Language (SGML), which requires a complex parser. Versions are distinguished in a document’s preamble, and today’s HTML5 [471] is an unambiguous syntactic specification, independent from SGML. A web browser parses an HTML document into a Document Object Model (DOM) [438]. XML [451] uses well-matched tags to serialize structured information and is consid- ered the lingua franca of electronic data exchange. A schema, expressed in some schema language, is like a grammar and characterizes a set of document by restricting document structure and eventually text contents. Examples for schema languages are DTD [451], XSD [443], and Relax NG [292]. Extended DTD (EDTD) [328] is a conceptual schema language primarily used in theoretical work. As argued in this chapter, XML is a core technology for client-cloud interaction and also the center of this thesis. The language properties are therefore discussed in greater detail in Section 3.2. Unrestricted XML has MIME type application/, and well-known subtypes are specified by schemas, e.g.,

13 CHAPTER 2. BACKGROUND: INTERACTION AND MONITORING the re-specification of HTML as XHTMLapplication/xhtml+xml ( ) and Scalable Vector Graphics (image/svg+xml). JSON [71] is a text-based key-value data interchange format, popular for web ap- plications, and the MIME type is application/. The syntax of JSON is a valid JavaScript program that evaluates to an object in a JavaScript runtime environment. Other text-based data serialization formats are markup languages Candle [79] and YAML [42], the Open Data Description Language [236], the Ordered Graph Data Language [424] for serializing data graphs, and comma-separated values for relational data. Base16, Base32, [205], percent-encoding [45], and quoted-printable encod- ing [136] are so-called binary-to-text encoding schemes to translate a binary value, i.e., a bit string, into a string of printable ASCII characters. Base64 is especially popular for embedding binary content in text-based data serialization formats such as XML or JSON. Binary formats are intended for machine readability and typically not human readable, e.g., video and audio formats. A well-known standard for specifying information structure and encoding rules is Abstract Syntax Notation One (ASN.1) [115], e.g., for encoding of X.509 certificates in a public key infrastructure352 [ ]. Furthermore, several binary encoding schemes for text-based JSON have been proposed for optimized processing, e.g., Binary JSON [77], Concise Binary Object Representation [65], and MessagePack [141]. Binary formats are often found in RPC architectures for call and argument serializa- tion, e.g., [25], Etch [23], and Thrift [22]; External Data Representation (XDR) [268]; [157]; and Hessian [81].

2.2.2 Container Formats

Conceptually, a container integrates arbitrary contents of varying content types into a single object or message. A popular container format is the MIME multipart con- tent type [137]. A multipart message has a text-based header that defines a boundary string to separate the message into parts. Every part has an individual header that specifies its local content type and additional metadata. Examples of multipart con- tent types are multipart/message for emails with attachments and S/MIME [350] multipart/encrypted and multipart/signed for encrypting or signing content. MIME multipart has a limitation: The format of a container is only consistent if the boundary string does not appear in any part of the container. Another container format, similar to MIME multipart but with improved boundaries, is Microsoft’s Direct Internet Message Encapsulation (DIME) [263].

2.3 Service Interaction Patterns

A content type specifies a message format, so clients and services can understand com- municated messages. This section recalls service interaction patterns, as introduced by Barros et al. [37, 38], to characterize how messages are conceptually exchanged between communicating parties. Figure 2.2 summarizes the service interaction patterns that have been defined based on three dimensions: number of participants, number of message exchanges, and whether third parties are involved. Consequently, there are four groups: single-transmission bilateral and multilateral patterns, multi-transmission patterns, and routing patterns. In this section, A, B, C, etc. are considered to be communicating parties.

14 2.3. SERVICE INTERACTION PATTERNS

Send Single-transmission Receive bilateral Send-receive Racing incoming messages Single-transmission One-to-many send Service multilateral One-from-many receive interaction One-to-many send-receive patterns Multi-responses Multi-transmission Contingent requests Atomic multicast notification Request with referral Routing pattern Relayed request Dynamic routing

Figure 2.2: Service interaction patterns for messaging between parties [37, 38]

2.3.1 Single-Transmission Bilateral Patterns Send. A sends a message to B. Related to: unicast, point-to-point send.

Receive. A waits for a message. Related to: listener, event handler.

Send-receive. A sends a message to B, and B sends a response message to A. The dual of this pattern is receive-send, where A responds to incoming messages. Related to: RPC, request-reply, request-response.

2.3.2 Single-Transmission Multilateral Patterns Racing incoming messages. A waits for the first message from a number of potential senders and messages. After receiving the message, A reacts according to the message type or sender, and later arriving messages are eventually ignored. Related to: deferred choice.

One-to-many send. A sends n messages of the same type to B1 ...Bn; however, message contents are not necessarily the same. Related to: multicast, broadcast, publish- subscribe, fan-out, event notification, scatter.

One-from-many receive. A awaits messages from B1 ...Bn for a certain time span, so messages are logically linked together. Related to: gather, fan-in, event aggregation.

One-to-many send-receive. A sends n request messages to B1 ...Bn and awaits n re- sponses for a certain time span. The dual of this pattern is one-from-many receive- send, where A awaits n requests from B1 ...Bn for a certain time span and returns n responses after processing. Related to: scatter-gather.

2.3.3 Multi-Transmission Patterns Multi-responses. A sends a request to B, and B returns responses until some stop- condition is met, e.g., explicit stop request from A, temporal conditions, inactivity, or explicit end notification by B. Related to: streamed responses, message stream.

15 CHAPTER 2. BACKGROUND: INTERACTION AND MONITORING

Contingent requests. A sends a request to B1, and if there is no response in time, the request is resent to B2, and so on. A continues until some Bi returns a response. Related to: send with failover.

Atomic multicast notification. A sends notifications to B1 ...Bn. At least i and at most j recipients are required to accept. When i = j = n, all recipients are expected to accept. Related to: transactional notification.

2.3.4 Routing Patterns Request with referral. A sends a request to B, and based on certain conditions, e.g., message content, responses are forwarded to C or C1 ...Cn. Related to: reply to.

Relayed request. A sends a request to B, and B forwards it to C1 ...Cn, who then respond to A. B observes the interaction between A and C1 ...Cn. Related to: delegation.

Dynamic routing. Routing conditions define how a message is forwarded by aparty, and conditions can be content-dependent and change over time. A message from A is eventually forwarded to B1 ...Bn based on the conditions. B1 ...Bn eventually forward the message to C1 ...Cm, and so on. Related to: routing slip, content-based routing.

2.4 Protocols for Computer Networks

Layered communication protocols for computer networks are an industry standard for delegating communication complexity, and reference models are the OSI model [499] and the simplified Internet model [67]. A transport mechanism for messages is a protocol stack, so the required properties like reliable transport or acknowledgments are satisfied. The state-of-the-art communication protocols used for transport mechanisms in today’s client-cloud interaction are summarized in Figure 2.3. While link layer protocols specify how connected devices can physically transmit information, e.g., Ethernet or Wi-fi networking, the Internet layer considers communication beyond physical boundaries by logical addressing of hosts and packet-based routing.

Content & Media Data Serialization WebRTC CoAP MQTT WebSocket RTPS AMQP Application SPDY SMTP STOMP DNS HTTP FTP XMPP Transport Optional security: SSL, TLS, DTLS Internet MPTCP QUIC UDP Lite DCCP Link TCP SCTP UDP RUDP IP, IPv6

Figure 2.3: The Internet model [67] and client-cloud communication protocols

16 2.4. PROTOCOLS FOR COMPUTER NETWORKS

Today’s Internet Protocol (IP) [343] and its successor IP Version 6 (IPv6) [110] constitute the backbone of the Internet. Both protocols specify dynamic routing of packets for bilateral packet exchange between hosts identified by logical IP addresses. Every packet has a header that stores the sender and destination address, similar to an envelope, and a byte-oriented payload. Packets are forwarded to their destination or discarded when the destination is unreachable—there are no delivery guarantees. Also, if a packet is too large for a single link layer frame, the packet is either fragmented or discarded. For one-to-many send interaction, multiple destinations can be addressed by broadcast and multicast in IP or multicast in IPv6. However, multicast is typically disabled in clouds because of an increased load on the provider’s network infrastructure [272].

2.4.1 Transport Layer Protocols Transport layer protocols enable payload-neutral inter-process communication. Char- acteristics to distinguish protocols are: uni- or bidirectional communication, stateful or stateless connection, message- or stream-oriented exchange, ordering guarantees, reliable delivery, data integrity, and congestion control. The Transmission Control Protocol (TCP) [344] and User Datagram Protocol (UDP) [342] are the most prominent transport protocols today and available in all modern operating systems. To enhance mobile and web experience when consuming a cloud service, other protocols such as MultiPath TCP (MPTCP) [132], the Stream Control Transmission Protocol (SCTP) [401], and Google’s Quick UDP Internet Connections (QUIC) [363] have been proposed.

Transmission Control Protocol. TCP establishes a bidirectional, stateful connection between two endpoints, i.e., processes, and information is bidirectionally exchanged in byte streams [344]. For compatibility with packet-based transmission, TCP transfers byte-stream segments with a distinguished header. To identify the endpoints, the header refers to a source and destination port number. The protocol synchronizes the connection in a three-way handshake during establishment and guarantees integrity and order for both streams using acknowledgments, retransmissions, and checksums. Furthermore, TCP supports flow control to adapt segment sizes in case of path congestion. TCPisa client-server protocol, i.e., a server waits for an incoming connection, and the connection can be terminated by both. In terms of interaction patterns, an established TCP connection supports bilateral send and receive interaction. To minimize handshake delay, the TCP Fast Open [86] extension introduces faster synchronization between familiar endpoints.

Multipath TCP. MPTCP is another TCP extension to introduce multihoming, i.e., the simultaneous presence of a host in multiple physical networks [132]. An MPTCP connection behaves similar to a TCP connection but actually aggregates multiple TCP connections over all available links, so the connection stays available even if one of the links is disrupted. In terms of interaction patterns, MPTCP is identical to TCP.

User Datagram Protocol. UDP is a minimalistic transport protocol with little over- head and stateless interaction [342]. An endpoint sends a byte-oriented message, i.e., a datagram, to another endpoint without delivery guarantees or ordering when multiple messages are sent. Similar to TCP, endpoints are referenced by a source and destination port, and a datagram checksum guarantees integrity. However, UDP does not specify retransmission in case of an incorrect checksum. In terms of patterns, UDP supports

17 CHAPTER 2. BACKGROUND: INTERACTION AND MONITORING bilateral send and receive and one-to-many send by resorting to IP broadcast or multi- cast. There exist several UDP-inspired protocols that distinguish in their features, but no applications in the web or cloud environments were found. Examples are the unre- liable Datagram Congestion Control Protocol (DCCP) [214] that provides congestion control, relaxed checksum calculation in UDP Lite [232], and Reliable UDP (RUDP) [66] supporting retransmissions, acknowledgments, and flow control.

Stream Control Transmission Protocol. SCTP has been designed as an alternative to TCP [401]. SCTP establishes a stateful association between two endpoints in a four-way handshake to eliminate the security threat of TCP SYN flooding118 [ ]. As in UDP and TCP, endpoints are identified by a source and destination port number. The protocol supports multihoming and multiplexing, i.e., the simultaneous transmission of byte- oriented messages within an association. For reliability, SCTP provides checksums, ordering and delivery guarantees for messages are optional, and congestion control is supported. In terms of interaction patterns, an established SCTP association supports bilateral send and receive interaction. However, transportation of SCTP over the Internet is not guaranteed because of its limited support in networking devices, also referred to as middleboxes.

Quick UDP Internet Connections. The experimental QUIC protocol by Google is an attempt to reduce latency and redundant transmissions in web environments [363]. QUIC operates on top of UDP for Internet compatibility, establishes a stateful connection for bilateral send and receive between two endpoints, and integrates the multiplexing concept from SCTP for simultaneous transmissions within a connection. The protocol also minimizes synchronization delays between familiar hosts like TCP Fast Open. QUIC features transparent data compression, checksums, forward error correction using an error-correcting code, retransmissions, and congestion control.

Transport Security Transport layer protocols transmit byte-oriented messages or streams and are therefore payload neutral. There are several standards that extend this payload neutrality to provide application-independent secure sessions for confidentiality, integrity, and authenticity. The two most prominent standards that operate on top of TCP are the Secure Sockets Layer (SSL) [138] and its successor Transport Layer Security (TLS) [113]. Both protocols provide bidirectional byte-oriented streams on top of TCP. An SSL or TLS handshake to establish a secured session typically follows after the TCP handshake. TLS explicitly supports an application-triggered handshake to upgrade an already established TCP connection at a later time [352]. SSL and TLS require the ordering and delivery guarantees of TCP or the handshake will fail. In case of MPTCP, TLS sessions can be established for the individual TCP connections in the aggregated MPTCP connection [379]. However, for multiplexed protocols, like SCTP and QUIC, TLS is not a good op- tion. When simultaneous interactions are multiplexed over a single TLS session, a single incorrect byte would stall all parallel interactions. Furthermore, SCTP supports partial reliability which conflicts with the reliability requirements of TLS. Datagram TLS (DTLS) [353] is an alternative secure session protocol on top of UDP, and it has been specified for other transport protocols, e.g., DTLS over DCCP [339] and DTLS for

18 2.4. PROTOCOLS FOR COMPUTER NETWORKS

SCTP [419]. The individual cryptography protocol in Google’s QUIC is motivated by TLS and active by default, so an interaction cannot be modified by middleboxes [231].

2.4.2 Application Layer Protocols Application layer protocols specify inter-process communication on top of transport pro- tocols. Informally, these protocols can be distinguished by: transport layer requirements, the use of a text-based or binary protocol syntax, stateful or stateless interaction, and whether data and control are transferred in the same or separated channels. For example, the File Transfer Protocol (FTP) [345] and the Simple Mail Transfer Protocol (SMTP) [211] are two popular protocols that are stateful and support TLS. While FTP separates binary data and text-based control into two TCP connections, SMTP interleaves the text-based commands and the email message in a single TCP connection. As transport protocols are payload neutral, an out-of-band agreement on the application protocol is necessary. One approach is to assign default application protocols to listening server ports. The Internet Assigned Numbers Authority (IANA) [192] therefore maintains a public list of port assignments, e.g., port 25/TCP is the default listening port for SMTP.

Domain Name System One of the core Internet protocols is the Domain Name System (DNS) [275], a distributed database for assigning human-readable names to numerical IP addresses. A fully-qualified domain name (FQDN) is the hierarchical name of a host, and this abstraction is used by the majority of Internet clients and services to locate endpoints and even for dynamic service discovery [87]. The DNS protocol for querying the distributed database is a stateless send-receive binary protocol. DNS operates on top of UDP for minimal latency but also supports TCP for large queries or DNS transactions. To assure correctness and authenticity of queries and responses, DNSSEC [30] introduces cryptographic methods. While DNS locates hosts in a network, a uniform resource identifier (URI) [45] identifies some resource as human-readable string. A uniform resource locator (URL) is a special type of URI that also locates a resource, and an example is illustrated in Figure 2.4. The application protocol for accessing the resource is defined by its URI scheme, and the transport protocols and ports for schemes are maintained in an IANA database. The authority part locates the host of a resource either by IP address or FQDN, and the path identifies the actual resource. Alternating port information or user credentials canalsobe embedded in the authority part. Authority and path represent the location of a resource, and a URL can additionally contain an optional query part and fragment to address a location within the resource. The so-called origin of a URL is the triple of its scheme, FQDN or IP address, and port [40].

Hierarchical part (location) http://www.ex.com/dir/fotos.php?action=view#page2 Scheme Authority Path (identity) Query Fragment

Figure 2.4: An URL identifies and locates a resource

19 CHAPTER 2. BACKGROUND: INTERACTION AND MONITORING

The Hypertext Transfer Protocol

HTTP is another core Internet protocol. It is text-based, stateless, and originally intended for bilateral send-receive interaction over TCP. Both data and control are interleaved in the same connection, and HTTP specifies message formats for requests and responses using delimiters, a header, and an optional body. Today’s version HTTP/1.1 [127] supports eight methods: OPTIONS, GET, HEAD, POST, PUT, DELETE, TRACE, and CONNECT. The header of a request states the method, a URI path of a resource, protocol version, additional headers, and eventually a body. A service returns a response, where the header states a status code, and the requested resource is attached as body if available. The MIME content type of a body is declared in a dedicated header field. HTTP/1.1 adds various features to the original specification, e.g., the Upgrade header to request an application protocol switch after a send-receive cycle, persistent connections and request pipelining for minimizing delays, Content-Encoding and Transfer-Encoding for body compression, and ETag and conditional GET for caching mechanisms. Furthermore, HTTP/1.1 adds support for client- or service-driven content negotiation [128]. A client can announce its User-Agent, accepted content types, char- acter and content encodings, and accepted languages in a request header, so the service can return a suitable representation. On the other hand, a service can return a multiple- choices status when multiple resource representations are available. Also, HTTP/1.1 adds cookie support for stateful HTTP sessions, i.e., a string identifier in request headers for application state tracking [39]. HTTP Secure (HTTPS) [351] is the default for SSL/TLS protected HTTP interaction and has its own URI scheme (https:). To prevent clients from accessing resources over insecure connections, HTTP Strict Transport Security [181] specifies a policy exchange to notify clients about HTTPS-only access.

Push Technology. Send-receive interaction in HTTP can only be initiated by the client. Push technology [7] refers to techniques that enable asynchronous send and receive interaction without breaking the HTTP specification, e.g., a service can push an event to the client. First attempts have been client-side long polling, where the requests hangs and the service delays its response until an event or timeout occurs. By exploiting an HTTP/1.1 persistent connection, Comet [366] (HTTP Streaming, HTTP server push) immediately responds with a multipart/x-mixed-replace container and gradually returns container parts as events. In terms of patterns, Comet is a multi-responses pattern. An approach using the Upgrade header, called Reverse HTTP [147], switches the roles of client and service after an initial send-receive cycle; the original service can then initiate interaction by sending HTTP requests, i.e., client-side receive-send interaction. Bidirectional streams over synchronous HTTP (BOSH) [330] establishes two TCP connections between client and service; the client uses the first connection for outgoing requests, and a hanging request in the second connection waits for asynchronous service responses. The state-of-the-art of push technologies are server-sent events (SSE) [461] and Web- Socket [462]. SSE is similar to Comet; a response of MIME type text/event-stream gradually delivers events as byte chunks, i.e., a multi-responses pattern, and the client reconnects automatically in case of a timeout. WebSocket uses the Upgrade header to establish a bidirectional byte-oriented connection that has similar properties like TCP. WebSocket supports TLS, it provides two URI schemes for unencrypted and encrypted connections (ws:, wss:), and in terms of patterns, an established WebSocket connection

20 2.4. PROTOCOLS FOR COMPUTER NETWORKS supports bilateral send and receive. However, an explicit application protocol within a WebSocket session needs to be negotiated during the HTTP Upgrade phase using designated HTTP headers.

HTTP Performance. A website concurrently requests multiple resources, but the num- ber of simultaneous TCP connections to a HTTP server are restricted, and persistent connections or pipelining are still sequential; the performance is therefore limited. Do- main sharding [397] is a workaround for the connection limit by distributing resources across different authorities to increase the maximum number of parallel connections. HTTP-NG [429], structured stream transport [133], HTTP over SCTP [283], and SMUX [430] are experimental standards that propose multiplexed resource access in a single connection to overcome the parallel connection limit. However, none of the experimental standards has reached practical relevance. Google SPDY [409] is the state-of-the-art for improving HTTP performance. SPDY introduces multiplexed resource access, prioritization, service-initiated push of related resources, and transparent content and header compression. SPDY is an augmentation for HTTP; it changes the HTTP wire format, but the semantics stay the same. In terms of patterns, SPDY enables send-receive and multi-responses interaction. An encrypted TLS session is mandatory, and SPDY support is negotiated between client and service during the TLS handshake using TLS Next Protocol Negotiation (NPN). Furthermore, the transport protocol QUIC was specifically designed for SPDY to minimize delays. The upcoming HTTP/2 standard will integrate WebSocket and concepts from SPDY as two core technologies [327]. However, the negotiation part will be replaced with TLS Application Layer Protocol Negotiation (APLN) in the final draft [41].

Web Client-to-Client Communication The increasingly popular client-side web browser extension for web real-time commu- nications (WebRTC) [467] enables direct client-to-client bidirectional send and receive interaction. However, for discovery, signaling between clients, and dealing with firewalls, a third-party service is still required. Motivated by video and audio streaming, WebRTC is DTLS based and also allows data exchange over an RTCDataChannel. This channel is in fact an SCTP over DTLS association and features optional ordering and message delivery guarantees. The clients still need to agree on an application protocol because an RTCDataChannel is payload neutral.

Messaging Protocols As message-oriented middleware receives an increased interest in cloud services, the messaging solutions also bring their proprietary and standardized application protocols into the cloud. For example, Microsoft Message Queuing (MSMQ) [265] has proprietary wire formats for interaction over UDP and TCP. The MSMQ Version 3.0 also supports message exchange over HTTP or HTTPS, and IP multicast for addressing multiple recipients. Proprietary messaging solutions are provided by TIBCO, e.g., XML-based formats over TCP or WebSockets in the enterprise message service [415] or proprietary formats over UDP and IP multicast in Rendezvous services [416]. Other implementation- specific wire formats found in messaging solutions arethe OpenMQ binary wire format in Java Glassfish154 [ ], the binary format OpenWire [20] for Apache ActiveMQ [19],

21 CHAPTER 2. BACKGROUND: INTERACTION AND MONITORING the binary format in TCP-based [27], and the binary frame format for the intelligent socket library ZeroMQ [190]. With respect to open messaging standards, the Advanced Message Queuing Proto- col (AMQP) [302] is the most prominent open standard for messaging. The AMQP transport model specifies a multiplexed binary protocol over TCP or SCTP that sup- ports message flow control, authentication, and TLS encryption. AMQP provides a self-contained datatype system to interconnect heterogeneous platforms, and peers ex- change so-called frames within an established session. The Extensible Messaging and Presence Protocol (XMPP) [369] is another open standard for exchanging XML stanzas in client-to-service or service-to-service open-ended XML streams over TCP. For web compatibility, XMPP also supports interaction over HTTP and BOSH [329], and XMPP over WebSocket is an ongoing standardization effort [404]. Another messaging protocol, motivated by HTTP, is the Streaming Text Oriented Messaging Protocol (STOMP) [402]. STOMP has a text-based format and operates over bidirectional streams, e.g., TCP or WebSocket. With respect to lightweight messaging in the Internet of Things, three open messaging standards are MQ Telemetry Transport (MQTT) [189], the Constrained Application Protocol (CoAP) [389], and Data Distribution Service for Real-Time Systems (DDS) [308]; all of them have binary message formats. MQTT has a small fixed-size header and operates over TCP with SSL/TLS support. CoAP is intended for HTTP compatibility but its messaging is done asynchronously over UDP, and CoAP supports IP multicast for group interaction. The binary wire protocol Real-Time Publish-Subscribe (RTPS) [309] is used in DDS for messaging over TCP and UDP, and IP multicast for group interaction is also supported. Table 2.1 summarizes the interaction patterns in communication protocols.

Table 2.1: Interaction patterns in communication protocols [226]

Protocol Interaction patterns and properties

IPv4, IPv6 Bilateral send and receive, multilateral one-to-many send (multicast, broadcast), dynamic routing TCP Bilateral send and receive, bidirectional byte streams with delivery and order guarantees UDP Bilateral send and receive, multilateral one-to-many send (IP multicast) MPTCP Bilateral send and receive, multihoming and multiplexing over individual TCP connections SCTP Bilateral send and receive, multihoming, multiplexing, byte-oriented messages, optional delivery and order guarantees QUIC Bilateral send and receive on top of UDP, multiplexing HTTP, HTTPS Client-to-service send-receive Comet Multi-responses from service to client, exploits HTTP persistent connections Reverse HTTP Client-side receive-send, switched roles after an HTTP send-receive cycle BOSH Bilateral send and receive, two TCP connections for HTTP, hanging request for service send SSE Multi-responses from service to client, similar to Comet WebSocket Bilateral send and receive, similar to TCP, established in an HTTP send-receive cycle SPDY Client-to-service send-receive, optimization of HTTP wire format, multi-responses WebRTC Client-to-client send and receive, SCTP with DTLS CoAP UDP-based alternative to HTTP in the Internet of Things, client-to-service send-receive, one-to-many send, One-to-many send-receive

2.5 Cloud Interaction Architectures

An architecture integrates transport mechanisms, languages, and high-level interaction patterns, into a blueprint for service delivery. Based on the historical evolution of cloud interaction technologies, architectures are discussed in a web-oriented and a service- oriented view, and Figure 2.5 summarizes the considered architectures.

22 2.5. CLOUD INTERACTION ARCHITECTURES

Web application Web-oriented view Web syndication Web mashup Architectures Remote procedure calls SOAP/WS-* web services Service-oriented view RESTful services Message-oriented middleware Figure 2.5: Web- and service-oriented view on cloud interaction architectures [226]

2.5.1 Web Architectures The World Wide Web has evolved from a Web 1.0 of static and non-interactive hypermedia contents, that were created only by few, to a more interactive and dynamic Web 2.0 with a strong social community aspect in publishing hypermedia content [98]. Personalization based on content semantics, including social, mobile, and location aspects, is considered to be a characteristic of Web 3.0, so the user experience goes beyond merely following hyperlinks [120]. Hypermedia delivery to clients is typically achieved by web applications, and web syndication and web mashups provide means of composition.

Web Application A web application is a service, hosted on a web server, to deliver websites and multimedia to clients. A website composes multiple web resources, in particular, HTML/XHTML hypertext that is parsed into a DOM by the client, Cascading Style Sheets (CSS) to specify the visual representation of the DOM, multimedia resources, and eventually JavaScript program code for dynamically modifying the DOM during runtime. All resources are located using URLs, and a web user agent, e.g., a web browser, retrieves the resources using HTTP or HTTPS as transport mechanism. In terms of interaction, a traditional web application inherits bilateral send-receive from HTTP. Web applications can also exploit HTTP for media streaming by multi-responses delivery of stream chunks. Hyperlinks use URLs to point to nonlinear continuations of the hypertext or additional multimedia content. A resource’s content type is predefined in the hypertext or announced in the HTTP response, and a user agent chooses a module for interpretation during runtime. In other words, messages in the web are dynamically typed. HTTP cookie headers, unique identifiers in URLs, and hidden form fields are then used to track application stateacross multiple requests, e.g., for authenticated sessions. The fundamental security model of web applications for coarse access control is the same-origin policy on the client side that isolates DOMs, HTTP cookies, JavaScript APIs, and read access to foreign web resources of simultaneously active websites according to their origins. This policy prevents a web browser from dynamically composing a DOM from different origins, e.g., by using JavaScript. Content Security Policy (CSP) [459] enables more fine-grained access control policies for allowing dynamic read accessto trusted third-party resources based on content type families. Web Distributed Authoring and Versioning (WebDAV) [116] has been one of the first user collaboration extensions for HTTP, and derived specifications like CalDAV106 [ ] and CardDAV [105] are widely used in services. The Web 2.0 introduces dynamic web applications for improved user experience, and

23 CHAPTER 2. BACKGROUND: INTERACTION AND MONITORING

Asynchronous JavaScript and XML (AJAX) [149] provided the technical means. AJAX enables client-side JavaScript code for dynamically requesting resources over HTTP or HTTPS, so the DOM in context of the script code can be dynamically updated when client-side events are triggered or the requested content becomes available. HTTP push technology for asynchronous service-initiated interaction typically relies on a client- side AJAX API, e.g., Comet. Furthermore, AJAX refers to XML for updates but any processable content type in JavaScript is accepted, e.g., JSON. However, the same- origin policy applies to AJAX too, and only resources from the same origin can be requested. Methods to circumvent the same-origin policy restriction are cross-origin resource sharing (CORS) [469] and JSON with padding [193]. Hypertext is natural language oriented, and the Semantic Web [466] is a W3C initiative toward standardization of machine-interpretable data and reasoning, i.e., Unicode and XML for a unified format; the Web Ontology Language (OWL) [460] for expressing relations; and the Resource Description Framework (RDF) [439] as a collection of vocab- ularies, a metadata data model, and formats for serialization. RDF-compatible standards for application in today’s web are RDF through attributes (RDFa) [465] for HTML and XHTML, HTML5-based Microdata [464], and JSON for Linking Data (JSON-LD) [472]. An independent approach to annotation-based semantics is Microformats [261]. The state-of-the-art HTML5 [471] combines an unambiguous hypertext syntax and a set of assisting technologies to resolve ambiguities in former versions that have led to incompatible user agents. HTML5 has been a major driver for the success of the Web 2.0 and its rich web applications because dependencies on third-party plugins, e.g., Adobe Flash, are dissolved. HTML5 includes Microdata for semantics, audio and video formats, offline storage in an App Cache, a File System API, Web Storage and IndexedDB for local storage, Web Workers for concurrency, Web Messaging between active DOMs, SSE and WebSocket for asynchronous interaction, and Geolocation. Popular technologies on the client side that are available in many modern browsers but not yet part of HTML5 are WebGL [207] for 3D rendering support, WebCL [208] for JavaScript access to parallel computing hardware, and WebRTC for client-to-client interaction.

Web Syndication

Interaction in web applications is bilateral between a client and a service, and a client needs push technology to recognize changes. Service-to-client and service-to-service notification are therefore aspects of web syndication. A webhook [239] is a service-side URL that is called by a client or another service to notify a change. This callback principle will eventually lead to an Evented Web [240] for service-to-service send interaction and sophisticated messaging between web applications and cloud services. Also, clients or services can subscribe to a feed or channel to receive updates, and two standards using XML messages are Rich Site Summary (RSS) [364] and [288]. Atom defines an HTTP-based publishing protocol (AtomPub) [168]. Web feeds are multilateral publish-subscribe messaging architectures for one-to-many send interaction, where clients and services subscribe to a publishing service and wait for updates by polling or asynchronous notification over webhooks or push technology for services or clients respectively.

24 2.5. CLOUD INTERACTION ARCHITECTURES

Web Mashup Web components are reusable web resources, e.g., multimedia, JavaScript libraries, gadgets [158] and services like Google Maps. The success of the Web 2.0 is also attributed to the mashup principle of composing web components. Web mashups are distinguished into service-side and client-side mashups [367]. While a service-side mashup is a service that gathers the web components and composes an integrated view for clients on the service side, a client-side mashup is a service that returns an HTML markup skeleton and JavaScript code, so web components are gathered and integrated by the client. So-called Open APIs have gained popularity for publishing API specifications of web components, and OpenSocial [318] is an initiative to ease the integration of the Web 3.0 social dimension in web applications. The integration of web components typically involves JavaScript code and proper encapsulation is required to suppress unintended information flows between web compo- nents [244]. Client-side content policies lead to two extreme cases of local script code interaction: no separation or isolation by direct embedding in the same DOM using a script tag and strong separation and isolation by embedding in isolated DOMs using object or iframe HTML elements. Ryck et al. [367] survey the state-of-the-art for more fine-grained controls in mashups, and they distinguish four categories of web component integration restrictions: separation of components and interaction by specified channels, e.g., HTML5 Web Messaging; isolation of JavaScript execution by static checking to enforce an isolation policy; cross-domain communication restricted by CSP, CORS, and AJAX proxy services; and behavior control for policy runtime enforcement through information flow control, object access mediation, and reference monitors. Table 2.2 summarizes the identified service interaction patterns in web architectures.

Table 2.2: Interaction patterns in web architectures [226]

Architecture Interaction patterns and properties

Web application HTTP or HTTPS send-receive interaction, multi-responses for audio and video streaming, push tech- nology and AJAX for asynchronous send and receive, dynamically typed Webhook Service-to-service send between web applications, callback URLs, for web routing patterns Web feed One-to-many send, i.e., publish-subscribe, service-to-client: by client polling or push technology, service-to-service: by webhooks Mashup Multiple simultaneous web interactions

2.5.2 Service Architectures Communication protocols for enterprise application integration have a strong focus on local-area networks, and they are often not correctly forwarded over the Internet and its middleboxes. Web technology has therefore increased the Internet compatibility of services [8]. Based on the accepted language, services can be distinguished: • Static typing. An interface definition language (IDL) specifies the accepted lan- guage, e.g., an XSD or ASN.1 specification, so a parser or stub code couldbe automatically generated. • Dynamic typing. The service interface accepts a family of content types or lan- guages. The interpretation of a message is therefore runtime dependent, and a proper parser is chosen based on the content type or an embedded message specifi- cation.

25 CHAPTER 2. BACKGROUND: INTERACTION AND MONITORING

Typing affects how strong clients and services are coupled. While the web is dy- namically typed and therefore loosely coupled, static typing restricts interoperability and flexibility, and the consequence is tighter coupling. RPC, web services, RESTful services, and message-oriented middleware are reviewed for the state-of-the-art in architectures for cloud service delivery.

Remote Procedure Calls RPC is a bilateral send-receive architecture, where a client calls a network-accessible procedure to get a return value [57]. An RPC service is specified by a transport mechanism, e.g., HTTP, a data serialization format, e.g., ASN.1, and an agreement on addressing and binding of functions. Historically, the Open Network Computing RPC [414] was the fist API widely avail- able in all major operating systems. Call and return messages are serialized in XDR and ex- changed over TCP and UDP. Distributed objects [427] are an evolution of RPC, and three well-known middleware standards are Microsoft Component Services (COM+) [264], the Common Object Request Broker Architecture (CORBA) [310], and Java Remote Method Invocation (RMI) [322]. All three rely on TCP as transport mechanism and have indi- vidual data serialization formats, in particular, DCE/RPC [411] for COM+; the Internet Inter-ORB Protocol (IIOP) [311] for CORBA; and the Java Remote Method Protocol, Oracle Remote Method Invocation, and the CORBA-compatible RMI over IIOP [323] for Java RMI. Interfaces are statically typed, and stub code on the client side abstracts away the network interaction for object allocation, transactions, and garbage collection. The Internet Communications Engine [498] is conceptually similar to CORBA and resorts web protocols for interaction. XML-RPC [385] serializes call and return messages in XML, uses HTTP as transport mechanism, and a service is identified by its URL. There is no schema for aservice interface and messages are therefore dynamically typed. The XML-RPC Description Language (XRDL) [493] is an IDL attempt to enable stub code generation. JSON- RPC [279] is similar to XML-RPC, but uses JSON for serialization and the transport mechanism is not restricted. Apache Avro [25], Apache Etch [23], and [22] have been proposed for high performance cloud backend RPC services. Avro is a serialization format and a transport-agnostic RPC mechanism for big data analysis in Apache Hadoop, where messages are dynamically typed by an attached description in JSON. Etch uses TCP as transport mechanism and specifies an IDL-like service description language anda data serialization format. Messages are statically typed, and stub code for clients and services can be generated from a service description. The Thrift framework by Facebook specifies an IDL that also enables stub code generation for statically typed messages. The framework is kept abstract, but some default procedures are provided, e.g., JSON over TCP or HTTP. Twitter Finagle [420] is based on Thrift and introduces multiplexing of parallel Thrift interactions over a single TCP connection. Similar to ASN.1, Google Protocol Buffers [157] is a specification language for abstract data structures and a binary data serialization format. Transport mechanisms are unrestricted, and stub code for the statically typed messages can be generated. Hes- sian [81] relies on HTTP as transport mechanism and specifies web-based RPC. The compact binary serialization format allows messages to be dynamically typed. Burlap [80] is basically the same protocol; however, the message serialization is XML based.

26 2.5. CLOUD INTERACTION ARCHITECTURES

SOAP/WS-* Web Services To overcome complexity problems from shared state in distributed objects and platform restrictions in tightly coupled architectures, web services have been proposed [8, 335]. Web services rely on open standards and self-contained messages for loose coupling. The Simple Object Access Protocol (SOAP) [447] for XML-based messaging is a core technology, and it is extended for reliability, security, transactions, orchestration, and business process aspects by web service standards (WS-*). SOAP supports in theory any transport mechanism, including message-oriented middleware; however, HTTP or HTTPS are recommended in the WS-I Basic Profile [481] for interoperability.

Technologies. The Web Services Description Language (WSDL) [433] is XML based for describing a web service by defining XSD message formats for available operations at an interface. Message bodies are therefore statically typed. Every interface has a bilateral message exchange pattern, e.g., in-only (client-side send), out-only (client-side receive), or in-out (client-side send-receive). Furthermore, a WSDL defines an interface binding (i.e., transport mechanism) and endpoints (i.e., web service URLs). To discover a WSDL description of a service, Universal Description, Discovery, and Integration (UDDI) [293] specifies a registry that is also published as a web service. Alonso et al. [8] characterize service interaction in four layers: HTTP or HTTPS as transport mechanism, messaging in the SOAP format, meta-protocols for service coordination, and high-level middleware properties for reliability, security, transactions, and orchestration.

Messaging. The XSD SOAP schema defines a root element envelope that holds an optional header and a body for the specified message formats. To transport arbitrary binary content in a SOAP message, small binary contents can be stored directly in the SOAP message as text using Base64 encoding. Larger binary contents can be more efficiently handled by encoding the SOAP message in a container format like Microsoft DIME, SOAP Messages with Attachments (SwA) [440], or the Message Transmission Optimization Mechanism (MTOM) [445]. SOAP messaging is similar to RPC because operations are only available at the WSDL- defined endpoints. Endpoint references and message addressing in WS-Addressing [441] introduce multilateral and routing patterns. An endpoint reference is a URL plus pa- rameters that address an endpoint of a service. Endpoint references for destination or error handling are then referred to in the SOAP header. WS-Eventing [458] and WS- Notification [294] for multilateral interaction on top of WS-Addressing are two competing standards for different publish-subscribe architectures as shown in Figure 2.6. While the publisher in peer-to-peer publish-subscribe has to handle subscriptions directly, a broker allows to address an anonymous group of subscribers. HTTPS does not ensure confidentiality when SOAP messages are dynamically routed over multiple services. WS-Security [295] therefore specifies the wsse:Security el- ement in a SOAP header, security tokens, and methods, i.e., XML Encryption [435] and XML Signature [453], for end-to-end encryption and cryptographically signed mes- sages. Further extensions are WS-Trust [306] and WS-SecureConversation [301] to establish a security context for speedier cryptographic key exchange. Reliable SOAP message delivery based on acknowledgments is standardized in WS-Reliability and WS- ReliableMessaging [300]. The definition and exchange of policy assertions, e.g., Quality-

27 CHAPTER 2. BACKGROUND: INTERACTION AND MONITORING

Publisher Publisher publish publish Broker

subscribe subscribe subscribe

Subscriber1 Subscriber2 Subscriber1 Subscriber2 (a) Broker-based publish-subscribe (b) Peer-to-peer publish-subscribe

Figure 2.6: Publish-subscribe architectures for multilateral interaction of-Service parameters, is then standardized in the WS-Policy [448] framework which is extended by WS-SecurityPolicy [305] for security-centric assertions. On a higher level of interaction, WS-Coordination [299] is a standard for coordinating service activities, e.g., business activities in WS-BusinessActivity [298] and transactions in WS-AtomicTransaction [297]. Standards for composition, orchestration, and modeling of composed web services on a business process level are studied by Beraka et al. [44], and examples are WS-Choreography [446], the Business Process Model and Notation (BPMN) [312], and the Business Process Execution Language (BPEL) [296].

RESTful Services A key principle of Representational State Transfer (REST) is resource orientation, i.e., abstraction of information as a resource for a unified interface between components129 [ ]. REST defines four rather abstract interface constraints without protocol restrictions; however, REST has been strongly influenced by web technology, where HTTP or HTTPS are default transport mechanisms. The four interface constraints are: identification of resources using URIs, manipulation of resources through their representations, self-descriptive messages, and hypermedia as the engine of application state. A RESTful service offers a set of resources that are identified by URIs. A representation is a binary sequence of a certain content typethat captures the state of resource. Methods then apply operations on a resource in send-receive fashion, i.e., HTTP method GET to read, PUT to create, POST to update, and DELETE to delete a resource. The advantage of using HTTP methods is the compatibility with existing cache infrastructures in the Internet, so services and clouds can scale easily. Furthermore, the messages in REST, e.g., HTTP requests and responses, are self descriptive according to their MIME content type and therefore dynamically typed. Metadata of a resource, i.e., HTTP headers, enable content negotiation, checksums, authentication, and access control during interaction [336]. A RESTful service is stateless in an interaction, and contrary to a web application, a client needs to track application state in self-contained send-receive interactions. Finally, REST promotes dynamic discovery of resources using hypertext. A client only needs to know the service entry point, i.e., a bookmark, and hyperlinks in the returned bookmark hypertext present the client further continuations from its current state.

Technologies. To describe a RESTful service, APIs and entry points are often just explained in natural language, e.g., Open APIs. WSDL Version 2.0 introduces support

28 2.5. CLOUD INTERACTION ARCHITECTURES for fine-grained HTTP bindings, so a RESTful service could be described in amachine- readable format too. The Web Application Description Language (WADL) [455] is XML based and expresses a set of resources, interrelations, supported HTTP methods, and MIME content types of resources for web applications and RESTful services. For XML- based resource representations, WADL natively supports XSD and Relax NG as schema languages. Dynamic service discovery is a registry of entry bookmarks, and examples are DNS service records [87], web syndication services as bookmark feeds, and the Google API discovery service [160]. Interaction in HTTP-oriented REST is bilateral send-receive. Security in REST is limited to secure transport, i.e., HTTPS, and there are is no standardized reliable delivery or transaction handling. Any content type for a representation is valid as long as it is supported by the client. With respect to composition, resources in RESTful services are often used as web components in web mashups. JOpera [332] is a framework for REST composition, and BPEL for REST [333], BPMN for REST [334], and the JavaScript-based S language [63] are examples for workflow orchestration. HTTP-oriented REST is designed to use four HTTP methods for resource operations; however, many clients such as web browsers are restricted to HTTP methods GET and POST. Pautasso et al. [336] therefore distinguish Hi-REST and Lo-REST. While Hi-REST is capable of all four HTTP methods and resources are by default represented in XML, Lo-REST relies on workarounds, e.g., an extra HTTP header or a hidden form field, to deal with restricted HTTP methods.

Constrained RESTful Environments. The Internet of Things is considered to be an environment with restricted computational power and memory [388]. CoAP [389] has therefore been proposed as an HTTP replacement for RESTful services in constrained environments. CoAP and HTTP share concepts, e.g., methods, content types, and URI paths, for compatibility reasons. However, CoAP adds certain features specifically for constrained environments, like separated acknowledgment responses, multicast support for one-to-many send interaction, and service discovery.

Message-Oriented Middleware Message-oriented middleware (MOM) [104] provides an infrastructure for scalable, flexible, reliable, and loosely coupled inter-process communication between peers, and examples are an enterprise service bus or message queuing in clouds. A message typically has a header and a body and represents some event as a self-contained, autonomous entity. A message queue, e.g., a first-in-first-out queue, allows storing, forwarding, and transforming messages for asynchronous interaction in MOM. Figure 2.7 presents the two approaches to messaging architectures. While broker-based messaging reduces commu- nication complexity, peer-to-peer messaging minimizes delays by a unified component in every peer for coordinating discovery and direct interaction. A peer can then participate as a client, service, or both [104]. In general, MOM enables sophisticated interaction in a group of peers ranging from bilateral send, receive, and send-receive exchange to dynamic routing. According to Curry [104], a MOM is characterized by a messaging specification, filtering mechanisms, message transformation methods, and additional properties to increase the overall Quality-of-Service. The messaging specification includes message formats, transport mechanisms, and adapters for interconnecting two MOM systems.

29 CHAPTER 2. BACKGROUND: INTERACTION AND MONITORING

Peer1 Peer2 Peer3 Peer2 Peer3 Message-oriented middleware Peer1 Peer4

Peer6 Peer5 Peer4 Peer6 Peer5 (a) Broker-based messaging (b) Peer-to-peer messaging

Figure 2.7: Architectures for message-oriented middleware [104]

Message filtering in a MOM is distinguished into channel-based, subject-based, content- based, and composite systems. In a channel-based system, a client subscribes to a channel for a certain class of events. A subject-based system allows a client to subscribe messages, where the header matches a certain pattern, e.g., the header subject. A content- based system is similar to subject-based systems, but pattern matching is also performed on the message body. A composite system then extends content-based matching to sets or sequences of messages. The middleware can also provide APIs for message transformation, e.g., XML transformation, when messages of varying content types originate from heterogeneous peers. Finally, a MOM can offer features for increasing integrity, reliability, and availability of messaging, e.g., by transactions and atomic multicast notification, delivery guarantees using acknowledgments, reliable delivery, message prioritization, load balancing, and clustering for fault tolerance.

Messaging APIs. To hide the technical complexity of messaging, a MOM is typically accessed through an API, and there are several attempts toward system-independent APIs to relax vendor lock-in in heterogeneous messaging systems. Java Message Ser- vice (JMS) [321] is a popular transport-agnostic API that offers high-level operations to create, send, receive, and read messages. According to the content type in a message’s header, the message body is dynamically typed. The Open Middleware Agnostic Messaging API (OpenMAMA) [317] is an open- source library for a unified API across multiple messaging infrastructures to ease ap- plication development and relax vendor lock-in. A MOM has to provide a so-called OpenMAMA bridge implementation. The RESTful Messaging Service (RestMS) [485] is another web-compatible API specification based on REST principles and using HTTP or HTTPS for transport. RestMS specifies URLs to post or receive dynamically typed messages using MIME content types, e.g., XML and JSON. The API specification also includes bridge profiles for integration of messaging systems, e.g., AMQP.

Proprietary Messaging Solutions. Microsoft MSMQ [265] is a peer-to-peer MOM, where networked queues are hosted in the operating system, and processes can send to or receive from a queue. MSMQ provides routing and prioritization of messages, transactions, delivery guarantees, authentication and encryption, and a type system for message bodies. As transport mechanism, MSMQ has an individual protocol or resorts to COM+. Furthermore, Microsoft .NET Windows Communication Framework (WCF) supports MSMQ as a transport mechanism for services; however, message bodies are then restricted to XML, binary, or ActiveX format. MSMQ allows bilateral asynchronous interaction and supports IP multicast for multilateral one-to-many send interaction, where

30 2.5. CLOUD INTERACTION ARCHITECTURES multiple queues are addressed. Other proprietary solutions are Microsoft SQL Server Service Broker [267] for a broker architecture, TIBCO Rendezvous [416] for a peer-to-peer architecture, Terracotta Universal Messaging [390], and Oracle Tuxedo Message Queue [320] in the Tuxedo cloud .

Open-Standard Messaging Solutions. AMQP [302] is motivated by the lack of inter- operability of proprietary messaging solutions, and it therefore specifies an open transport model, i.e., an agreed wire format, and a semantic queuing model for messaging. A queue in AMQP supports ordering, searching, and transactions. Using a self-contained type system, a message has a dynamically typed body and also a header with a distinguished routing key. A binding relates a queue with a broker-like exchange. A message is sent to an exchange and forwarded to all queues with a matching binding. With respect to message filtering capabilities, there are four types of exchanges: direct, topic, fan-out, and headers exchange. A direct exchange forwards a message to queues, where the binding matches exactly the routing key, i.e., a channel-based system. A topic exchange supports routing key patterns in bindings and therefore enables a subject-based system. A fan-out exchange forwards a message to all queues without filtering, and a headers exchange matches predicate arguments on a message header, i.e., a content-based system. In terms of interaction patterns, AMQP supports one-to-many send to address multiple queues over exchanges, but also send and receive interaction is possible. There exist URI schemes for exchanges (amqp:, amqps:), so brokerage can be offered as a service. AMQP also provides authentication, encryption, transactions, and delivery guarantees. XMPP [369], originally known as Jabber, has a history in Internet chat applications for instant messaging, presence information, and contact lists. However, XMPP also has middleware properties and supports bilateral send and receive of XML-based messages, where an XMPP service acts as a broker for clients or interacts with other XMPP services in a federated architecture. XMPP can use any bidirectional protocol, e.g., TCP, HTTP, or WebSocket, as a transport. Extensions for XMPP are managed in a community process, e.g., Base64-encoded binary content [370], RPC over XMPP [5], automated service discovery [179], publish-subscribe architectures for one-to-many send interaction [269], addressing for dynamic routing [180], reliability guarantees [270], and end-to-end en- cryption using S/MIME containers. XMPP messages are XML stanzas with untyped text-based content. For middleware application, either a custom protocol, e.g., XMPP bits of binary [370] and SOAP [134], or out-of-band agreement is necessary. The text-based STOMP [402] is for asynchronous send and receive interaction in a broker architecture. A client establishes a session to a broker service, and both start to exchange so-called frames for messages, receipts, and errors. Similar to an HTTP request or response, a frame has a header, including a MIME content type field, and a dynamically typed body according to content type. STOMP supports transactions and message acknowledgments for reliability. With respect to interaction, STOMP provides bilateral interaction between client and service or multilateral one-to-many send in a broker-based publish-subscribe architecture. MQTT [307] is an open standard for lightweight messaging in the Internet of Things. Supported transport mechanisms are TCP, TLS, and WebSocket. With respect to inter- action, MQTT is intended for broker-based publish-subscribe architectures, i.e., one- to-many send interaction, and a message transports up to 256 megabytes of untyped content. An out-of-band agreement on the content type is therefore required. MQTT uses

31 CHAPTER 2. BACKGROUND: INTERACTION AND MONITORING acknowledgments and retransmissions for reliability but does not support transactions. DDS [308] is another open standard for machine-to-machine interaction in the Internet of Things and supports a peer-to-peer publish-subscribe architecture, i.e., one-to-many send interaction, with scalability, high throughput, and real-time delivery in mind. Publish- ers, subscribers, and topics are partitioned into domains. Messages for a certain topic are statically typed according to an IDL specification. When a subscriber requests a certain topic, the publishers in the domain use peer-to-peer interaction for message delivery. DDS provides dynamic discovery for locating peers, negative acknowledgments for reliable delivery, and rich Quality-of-Service policies for transmission over the RTPS [309] wire protocol. Authentication and encryption are not yet part of the official standard [313]. Apache Kafka [27] is a software implementation that specifies a broker-based publish- subscribe architecture, i.e., one-to-many send interaction, for high-performance appli- cations. Messages are sent over TCP and the Kafka server implementation supports clustering, message replication, and persistent storage for fault tolerance. A producer publishes messages for a certain topic, and consumers subscribe to a topic, i.e., a channel- based system. Moreover, a Kafka cluster manages a partitioned log for every topic, where every partition is an ordered list of already published messages. The partitions are replicated in the cluster, so the ordering of published and consumed messages is guaran- teed. Messages are deleted after a certain timespan from the log. A consumer maintains an offset in the ordered list of messages for a subscribed topic, so older messages are still accessible if they have not expired yet. The body of a message is an untyped byte sequence, so an out-of-band agreement on the content type is required. ZeroMQ [190] is an intelligent socket library that uses network protocols, e.g., TCP, UDP, and IP multicast, to provide various interaction patterns between peers. A process or thread maintains a local message queue that is accessible through a socket. ZeroMQ supports bilateral send, receive, and send-receive interaction; dynamic routing; and publish-subscribe one-to-many send interaction. There is no dedicated broker in ZeroMQ; however, a broker can be implemented using the library. Message bodies are untyped byte sequences, and an out-of-band agreement on content-types is necessary. Software examples are NullMQ [241], a WebSocket-based JavaScript library that bridges ZeroMQ into a web browser using a modification of STOMP, and ZeroRPC [114] for web-based RPC using JSON-based MessagePack as dynamically typed data serialization format.

Message Queuing as a Service. A MOM broker is a critical component in a messaging infrastructure, and needs for scalability and fault tolerance have motivated message queu- ing in the cloud. One of the first service offerings was Simple Queue Services (SQS) [11] by Amazon Web Services. SQS provides HTTP and HTTPS access to an internal SOAP/WS-* service stack and supports untyped text messages of limited size. Pull Queues [162] and Push Queues [163] are used for messaging and task distribution in Google’s App Engine. Messages in JSON format are sent or received over a RESTful API for both queue types. For message delivery on the client side, pull queues need to be polled, and push queues use webhooks. Cloud Pub/Sub [161] is a broker-based publish-subscribe service for the Google App Engine or web clients. A publisher sends a JSON-based message of a certain topic to a RESTful API, and the message is stored in subscriber queues, so message delivery can be tracked. Subscribers then poll the API or get notified over a webhook. Azure Queues and Service Bus Queues [262] are Microsoft cloud services. Azure Queues provide a RESTful API for bilateral exchange of untyped text messages between

32 2.5. CLOUD INTERACTION ARCHITECTURES two peers. Multilateral interaction such as broker-based publish-subscribe or dynamic routing is then enabled by Service Bus Queues, where messages carry content types. AMQP brokerage is offered by IronMQ [194], StormMQ [403], and CloudAMQP [94]. CloudMQTT [95] offers message brokerage for complex event processing. A RESTful broker-based publish-subscribe architecture is offered by Rackspace Cloud Queues [347]. Table 2.3: Interaction patterns in service architectures [226]

Architecture Interaction patterns and properties

XML-RPC HTTP send-receive, dynamically typed JSON-RPC Send-receive, dynamically typed Apache Avro Send-receive, dynamically typed Apache Etch TCP send-receive, dynamically typed Apache Thrift Send-receive, statically typed Protocol Buffers Send-receive, statically typed Hessian, Burlap HTTP send-receive, dynamically typed SOAP Send, receive, send-receive, statically typed WS-Adressing Multilateral interaction and routing patterns, e.g., relayed request WS-Eventing, WS- One-to-many send or one-from-many receive, e.g., publish-subscribe architectures Notification REST Send-receive, i.e., HTTP or CoAP, dynamically typed JMS Transport-agnostic API, dynamically typed OpenMAMA Middleware-agnostic messaging API RestMS REST-based messaging API JMS-compatible OpenMQ, IBM Websphere MQ, TIBCO Enterprise Message Service Proprietary TIBCO Rendezvous, Oracle Tuxedo, Terracotta Universal Messaging MSMQ Peer-to-peer, send, receive, one-to-many send (using network multicast), routing patterns, broker- based: SQL Server Service Broker AMQP Broker-like exchanges, send, receive, one-to-many send, dynamically typed XMPP Broker-based send and receive; with extensions one-to-many send, i.e., publish-subscribe, untyped or dynamically typed STOMP Send and receive, broker-based one-to-many send, i.e., publish-subscribe, dynamically typed MQTT Broker-based one-to-many send, untyped DDS Peer-to-peer one-to-many send, i.e., publish-subscribe, statically typed Apache Kafka Broker-based one-to-many send, i.e., publish-subscribe, untyped ZeroMQ Untyped socket abstraction, various patterns

2.5.3 Cloud Computing Aspects Service interaction patterns in communication protocols and architectures for cloud interaction are summarized in Table 2.1, Table 2.2, and Table 2.3 respectively. For implementation examples, the reader is referred to the survey [226]. For state-of-the-art link layer technologies in cloud datacenters, the reader is referred to Bitar et al. [59]. Transport mechanisms in cloud computing are primarily for bilateral interaction between a client and a service. Multilateral interaction, e.g., provided by UDP over IP multicast, could offer advantages for media applications; however, according to Mitchell [272], today’s cloud providers disable multicast because of the increased network load. Therefore, architectures such as SOAP with WS-Addressing, publish-subscribe services, and MOM provide multilateral interaction patterns between a group of peers on a higher level using transport mechanisms as building blocks. Today’s cloud PaaS and SaaS paradigms primarily focus on web transport mechanisms and architectures because they are compatible with the Internet and reliably forwarded. Examples are HTML5 web applications and mashups, RPC over HTTP, SOAP/WS-* web services, and RESTful web services. Also, a trend toward web-compatible and platform- independent messaging has been observed, e.g., XMPP over HTTP or WebSocket, SOAP

33 CHAPTER 2. BACKGROUND: INTERACTION AND MONITORING over XMPP, MQTT over WebSocket, AMQP over WebSocket, RESTful APIs for message brokers, and NullMQ. XML his a fundamental building block in many technologies, e.g., XHTML websites, AJAX updates in XML, RSS and Atom web feeds, XML-RPC, SOAP/WS-* messaging, resource representation in REST, and XMPP messaging. Grozev and Buyya [170] and Toosi et al. [417] review the state-of-the-art in inter- cloud computing which is a natural evolution of the cloud paradigm toward federations of clouds. Messaging over XMPP has been proposed as a foundation for inter-cloud com- munication [46, 82] and inter-cloud interoperability [47]. SOAP/WS-* web services [78] and AMQP [46] are also considered as candidates for inter-cloud.

2.6 Monitoring of Client-Cloud Interaction

A monitor or monitoring system observes the behavior of another system to verify cor- rectness properties, e.g., for security, fault analysis, failure recovery, or debugging [111]. With respect to cloud computing, this section recalls the components and purpose of a monitor, and state-of-the-art observation points for data acquisition in cloud computing are reviewed. In accordance to Biskup [58, p. 359] and Schroeder [384], Figure 2.8 illustrates the high-level components of a monitoring system. An observation component senses raw data from an observation point in the system under observation and generates audit data, e.g., events, by preprocessing the raw data. Audit data can be stored in an audit database for offline analysis or processed directly for online analysis. Based on a configured monitoring policy, the audit data, and eventually knowledge from cooperating remote agents, the analysis component generates alerts in case of policy violations. Alerts are forwarded to a reaction component. In accordance to configured reaction templates, alerts are usually reported for logging purposes, e.g., to an administrator. Behavior of the system under observation could be enforced as a response, e.g., by graceful termination or degradation, recovery procedures, and corrective measures. Finally, the monitoring system eventually informs cooperating remote agents about an alert.

Audit database Monitoring policy Reaction templates

Audit data Analysis Alert Reaction Offline Observation Online Response Report Coop.

Raw data System Enforce under Cooperating agents Observation point observation Administrator

Figure 2.8: High-level components of a monitoring system (based on [58, p. 359])

34 2.6. MONITORING OF CLIENT-CLOUD INTERACTION

The fundamental motivation for monitoring according to Plattner and Nievergelt [341] is that software developers have better intuition about certain system state aspects than about implemented code. Historically, execution monitoring [237, 382, 384] has been the first attempt toward debugging, checking correctness, and runtime verification. To- day’s monitoring use cases also include dependable systems [33], fault tolerance [111], online failure prediction [372], safety in distributed systems [156], Service Level Agree- ments (SLAs) [97], and web services [201]. Monitoring also has a strong interconnection to security, in particular, intrusion detection [233] and access control [373].

2.6.1 Cloud Monitoring When a client consumes a cloud service, both enter a relationship, often agreed in a contract, and it is in the interest of both parties to verify their claims. Several aspects of cloud computing strengthen the need for monitoring. First, a client entrusts data to a service and loses governance [32]. Monitoring the service could assure the client that the service behaves as agreed. Furthermore, a cloud provider offers an execution environment that could use black-box components, e.g., software libraries, internal services, and hidden procedures. A client could be forced to use these black-box components but has no means of verifying their internal state, and the consequence is an incomplete model [197]. The cloud’s execution environment needs hidden routines to implement elasticity and scalability; however, these hidden distributed computations could lead to randomness and operational nondeterminism in a service’s run [197]. Hidden routines and shared computational resources can furthermore lead to cross-tenant information flows that cannot be observed or verified by the client [123]. Finally, a cloud is typically restricted to certain platforms, technologies, or programming languages. These restrictions can result in insufficiencies and motivate workarounds; the consequences could be untreated exceptional states [197]. Given the historical background, the following three high-level goals of monitoring have been identified for cloud monitoring:

• Correct execution. A monitor derives internal and external state of the system under observation from its behavior. States and state transitions of a run are checked with respect to a monitoring policy, e.g., according to a specification or correctness conditions. This enables the monitor to take action if an unexpected state is reached or conditions are violated.

• Dependability. Hardware and software are assumed to be faulty, and unexpected or exceptional states can occur anytime and need to be handled appropriately, so the offered service by the system under observation satisfies dependability requirements. Examples are online failure prediction and intrusion detection.

• Measurement. When service consumption is defined in a contract, e.g., in anSLA, the service quality requirements need to be described in measurable quantities. Monitoring measures and verifies quality metrics with respect to the contract. Actions in case of a violation include automated claims and content negotiation.

Aceto et al. [4] review the state-of-the-art in cloud monitoring, in particular, the landscape of commercial and open-source monitoring platforms and Monitoring-as-a- Service offerings. The surveyed monitoring systems are distributed systems that resort to agents and probes to collect data from facility, hardware, networks, operating systems,

35 CHAPTER 2. BACKGROUND: INTERACTION AND MONITORING applications, middleware, and users. The monitoring systems are furthermore charac- terized according to properties such as scalability; elasticity; resilience, reliability, and availability; adaptability; timeliness; autonomicity; comprehensiveness, extensibility, and intrusiveness; and accuracy. This section highlights the available observation points for monitoring of client-cloud interaction.

2.6.2 Observation Points The observation points are categorized according to structural layers for hardware, net- works, operating systems, services, middleware, and users in clouds.

Hardware Layer On-chip sensors in hardware record health information, e.g., performance and activity counters, failures, temperature, voltage, and clock frequency. In clouds, this information is typically available to the cloud provider and needs to be collected by a software agent. However, when a client has direct hardware access, e.g., the graphics processors in Amazon EC2 for parallel computing, some performance counters can be read. Mobile devices also have sensors for device motion [12] and geographical position [159] that could be exposed as client-side hardware sensors by an agent.

Network Layer Monitoring of network communication on the client or service side is a popular technique when the system under observation is considered as a black-box component. Section 2.4 already studies protocol stacks and architectures for client-cloud network communication, and observation points are located on the different layers of the protocol stack. Packet delivery in IP and IPv6 networks is based on the IP packet header, i.e., destination address, and a router should leave the payload untouched. However, an increasing number of network devices inspect also packet payloads for routing and filtering decisions which is generally referred to as Deep Packet Inspection (DPI) [43].

Simple Network Management Protocol. The UDP-based Simple Network Manage- ment Protocol (SNMP) [175] has a long history in network monitoring. An SNMP architecture distinguishes a manager and network devices that utilize agents. An agent maintains variables, i.e., Object Identifiers (OIDs), for device-specific network measure- ments and configuration settings. Variables are organized in a hierarchical namespace referred to as Management Information Base (MIB). A manager then gets or sets OIDs on network devices using SNMP, and measurements can serve as observation point [413].

Packet Headers. Monitoring of IP packet header information, including a view on transport protocol headers, e.g., TCP segment headers, UDP datagram headers, and Internet Control Message Protocol (ICMP) messages, can reveal inter-process communi- cation behavior. The communicated payloads are left untouched. Examples are network firewalls that implement an access policy based on IP addresses and TCP and UDPports of services.

36 2.6. MONITORING OF CLIENT-CLOUD INTERACTION

Packet Payload. DPI technology for packet payload observation collects data from packets beyond protocol headers. The payload of an IP packet is a fragment of a higher- layer protocol, e.g., a TCP segment or UDP datagram. Actual messages or byte streams can however span over multiple packets. This view is limited for encrypted protocols, e.g., SSL, TLS, and DTLS. In this case, only the negotiation of encryption parameters is observable in clear text. Payload-based inspection can be further distinguished into:

• Packet-level payload inspection. Each observed packet payload is considered an independent event, and conversations on higher-layer protocols are ignored for the sake of simplicity.

• Reassembling payload inspection. Higher-layer protocol messages or streams are reconstructed by the monitor by reassembling packet payloads from TCP seg- ments or fragmented UDP datagrams. However, characteristics of higher protocols need to be correctly implemented, e.g., TCP allows out-of-order segments and retransmissions [487].

Network Flows. Another approach to network observation is to collect network flows, and techniques have been reviewed by Celedaˇ and Krmícekˇ [423]. A network flow is a record that summarizes a uni- or bidirectional flow of packets that share source and destination address, transport protocol, and port information [486]. For example, a TCP session is reduced to one bidirectional flow or two unidirectional flows, depending onthe actual technique, and measurements include a timestamp, number of packets and total number of bytes in the flow. As payload data is not stored in a flow, less privacy concerns are raised, and also encrypted sessions can be recorded. Two notable standards are Cisco NetFlow [90] and IPFIX [183]. A network device or software agent exports network flows to a so-called collector for further analysis.

Circuit-Level Gateway. The concept of circuit-level gateway has been proposed as transport layer policy enforcement point in a network [410], and today, Sockets Se- cure (SOCKS) [235] is the most popular standard. A client first establishes a session with the gateway, so the gateway can exchange TCP segments and UDP datagrams on behalf of the client and independent from application protocols. The circuit-level gateway has an unadulterated view on bidirectional TCP byte streams or UDP datagrams; however, client configuration is necessary.

Operating System Layer While cloud platforms and services are run on operating systems under control of the service provider, IaaS clouds offer virtualization, so clients can run individual operating systems [32]. Observation points in the operating system are potential data sources for client- and service-side monitoring.

Logs and Audit Trails. Operating systems maintain log files to store errors, failures, notifications, and debugging events in chronological order for administrative purposes. A log record includes a timestamp, location, type and classification, and a log message. An example is Syslog [152] which specifies a record format and also a protocol for network propagation. Also, an explicit audit trail for security-relevant events can be provided by

37 CHAPTER 2. BACKGROUND: INTERACTION AND MONITORING the system under observation, e.g., access logs, the Linux audit framework [319], and the Basic Security Module (BSM) [418].

System Activity. Operating systems record system activity properties of the kernel, run- ning processes, and interfaces, e.g., processor utilization, system load, network interface statistics, and memory utilization. Furthermore, many of these metrics are collected and individually exposed for every process, e.g., the /proc and /sys file systems in Linux.

System Calls. The interface between processes and the kernel of an operating system is a special software interrupt referred to as system call, e.g., to open a file or fork a process. The motivation for system call monitoring in Linux and Unix systems is that the behavior of a process reflects in an ordering of system calls [145], e.g., for dynamic analysis of malware [119]. For auditing and debugging, a modern operating system is usually capable of exposing system calls, e.g., strace [234] and dtrace [167].

Call Stack Inspection. A modern operating system isolates processes from each other, and every process has an individual call stack of activation frames. A frame stores function-local variables and program position, and a function call stores a new frame on the stack for later continuation. Variable values and ordering of frames on the stack are a potential observation point, and a kernel modification for call stack inspection has been proposed by Feng et al. [122].

API Hooking. The Microsoft Windows API is a set of operating system functions that are conceptually comparable to system calls in Linux and Unix [119]. A so-called API hook redirects an API function call to some handler for monitoring of calls, arguments, and behavior enforcement. Examples for API hooking also include dynamic malware analysis [119].

Virtual Machine Introspection. Techniques like API hooking, call stack inspection, or a modified kernel interfere with the operating system and could be identified byarunning process. This problem especially exists in behavioral analysis of potentially malicious software, where malware changes its behavior when a monitor has been detected [119]. Virtual machine introspection is an approach toward resilient monitoring. The target operating system runs in a virtual machine, e.g., in a cloud, and a virtual machine monitor provides access to the virtualized hardware. A monitor then needs to reconstruct the state of the operating system and processes from activity in the virtualized hardware [146, 282].

Service Layer

Based on the cloud delivery model, applications that provide a service are implemented by the provider (i.e., SaaS) or the client (i.e., PaaS). Observation points for the behavior of applications and services are therefore considered. Applications often maintain an event log for auditing and debugging which could serve as observation point, e.g., transaction logs, performance logs, and event notifications.

38 2.6. MONITORING OF CLIENT-CLOUD INTERACTION

Code Instrumentation. Code instrumentation gathers data such as function call traces, conditions, and variable values directly from an application or service during runtime. Techniques for code instrumentation are distinguished into manual instrumentation, where a developer explicitly adds pre- and post-conditions that emit the information during runtime, and automated code instrumentation. Automated code instrumentation can be further distinguished into bytecode instrumentation [18, 56, 285], code rewriting [284], and aspect-oriented programming by inserting advice statements [289, 290, 354]. While static code instrumentation inserts the monitoring statements at compile time, dynamic code instrumentation rewrites the code before or during interpretation. For example, Magazinius et al. [246] present a dynamically inserted reference monitor for JavaScript code. Historically, code instrumentation has been used for debugging and profiling, and monitoring use cases are information-flow security [368] and control-flow integrity enforcement [2].

Middleware Layer The middleware layer considers message exchange and architectures that enable clients and services to interact. Technologies have already been discussed in previous sections. A middleware implementation, e.g. for SOAP/WS-* web services or MOM, typically supports logging for auditing purposes, e.g., audit frameworks [482], message flow logs [375], and brokerage logs [483].

RPC Filter. In RPC-based architectures for client-cloud architectures, the procedure call and return arguments are serialized in an agreed format, and a web-compatible transport mechanism is used for interaction. An RPC filter is then similar to a function hook that intercepts calls and returns.

Message Queue Filter. Message queues are found at the client and the service. A message queue filter478 [ ] is a software component that observes messages in a queue. Furthermore, in broker-based messaging, the broker receives incoming messages at a certain endpoint, and a monitor could be integrated by replicating messages, e.g., by adding a blind-copy endpoint reference in WS-Addressing for SOAP/WS-* web services.

Web Proxy. For caching and content filtering in web architectures, a web proxy for- wards HTTP requests and responses between a client and a service [245]. Web proxies can be distinguished into forward and reverse web proxies [407]. A forward proxy establishes interaction with a web application on behalf of a client, and a reverse proxy accepts interaction on behalf of a web application or RESTful web service. Both types of web proxies observe complete HTTP requests and responses, and a proxy can also enforce actions. However, a web proxy cannot observe encrypted sessions and either just blocks or forwards encrypted data. An X.509 certificate for SSL/TLS in HTTPS is issued fora certain domain name, so a client can verify the identity of a service based on the domain name. A forward proxy would need a fake X.509 certificate or influence the encryption handshake to gain access to the clear-text requests and responses [245].

Suffix Proxy. A suffix proxy245 [ ] is similar to a web forward proxy by interacting with a service on behalf of a client; however, a suffix proxy supports inspection of HTTPS traffic because two independent SSL/TLS sessions are established, i.e., client-to-proxy

39 CHAPTER 2. BACKGROUND: INTERACTION AND MONITORING and proxy-to-service. There is no need for explicit configuration in the client. A suffix proxy is in fact a service for a certain DNS domain, e.g., suffix.org, so requests and responses are transparently forwarded to web applications identified by the prefix ofthe domain. For example, HTTP requests to google.com.suffix.org are forwarded by the proxy to google.com. However, a suffix proxy needs to adapt all forwarded requests and responses to update hyperlinks.

Integrator. In a client- or service-side web mashup, an integrator composes web com- ponents by using JavaScript, and it can be seen as a middleware [120]. By instrumenting JavaScript code, an integrator can add monitoring statements for integration-driven mon- itoring, e.g., an observation point for HTML5 Web Messaging between isolated web components or inline JavaScript security monitors [245].

User Layer To monitor the operations and behavior of users is another observation point, in particular, for cloud services that exhibit social aspects. PaaS and SaaS implementations often pro- vide logs for auditing or service adaptation based on user preferences, e.g., authentication, access, geolocation, and user history logs. Actively emulating user behavior in web architectures is primarily concerned with monitoring of availability and performance of a service and referred to as synthetic user monitoring [103]. However, the emulation cannot provide insights into behaviors that occur when many real users interact with a service. To observe the behavior of real users of a service, e.g., by monitoring whether requests are served in time, is referred to as real user monitoring [103]. This procedure collects data passively from interacting users. Examples are JavaScript injection for client-side monitoring, client-side agents, and user-centric network observation.

2.7 Implications of Technological Trends

Today’s technologies for client-cloud interaction promote higher-level messaging between clients and services; however, services are often black-box components and cannot be monitored directly by the client [249]. However, according to the LangSec threat model, complete messages need to be observed for accurate monitoring. Observation points on the service, network, and middleware layer are natural choices for observing the exchanged messages, but there are several trends that indicate difficulties for network monitoring, especially for DPI technology. Multiplexing in transport mechanisms between client and service aims to minimize latency in connection establishments when multiple parallel interactions take place, e.g., SPDY and HTTP/2. To monitor multiple parallel message transmissions, sent in a single TCP connection, a DPI system needs to demultiplex and reassemble individual messages. Also, DPI observation of network traffic is restricted to a physical domain, and mobile devices can have multiple active links outside of the monitored physical domain, e.g., 3G/4G networks. Multihoming transport protocols like MPTCP in mobile devices [28] utilize available links for increased service availability, and a DPI monitor has therefore an incomplete view. There is also a trend toward pervasive encryption in transport mechanisms, so network interaction cannot be monitored, e.g., SSL/TLS-secured HTTPS, SPDY, MPTCP, and

40 2.7. IMPLICATIONS OF TECHNOLOGICAL TRENDS

QUIC. A DPI monitor would need to break the encryption, which is hard, or introduce forged X.509 certificates in a man-in-the-middle attack. So far, forging of certificates has worked by compromising or cooperating with a certificate authority. These authorities are usually shipped with client software and implicitly trusted, so issued certificates for services are automatically trusted too. However, there is an increasing distrust in certificate authorities as a consequence of misuse188 cases[ ], and an observed trend is so- called certificate pinning [326]. Additional information like service-specific restrictions on trustworthy certificate authorities or fingerprints of expected public keys in certificates are shipped with clients, so man-in-the-middle attacks by DPI monitors can be detected. Examples are Google Chrome [408] and reverse engineering protections in modern mobile phone apps [324]. Consequently, DPI technology is not future proof for monitoring the language in an actual message exchange. With respect to the proposed security monitor in this thesis, observation points need to be situated in the client or service software or on the middleware layer, e.g., in a or suffix proxy.

41 CHAPTER 2. BACKGROUND: INTERACTION AND MONITORING

42 Chapter 3

Literature Review

This chapter studies the state-of-the-art with respect to the language-based anomaly detec- tion approach. First, anomaly-based intrusion detection is reviewed in Section 3.1, includ- ing the language-theoretic view on intrusion detection and resulting evasion methods. The section furthermore investigates related work on anomaly detection for tree-structured data in greater detail. XML and schema languages are introduced in Section 3.2 in a nutshell. XML attacks are the motivation of this research, in particular, the signature wrapping attack, and Section 3.3 therefore summarizes known attacks and limitations in countermeasures. Section 3.4 reviews stream validation as a countermeasure against attacks and introduces the notions visibly pushdown automata for XML. Learning a language representation is related to schema inference, and the state-of-the-art is therefore introduced in Section 3.5. To close the chapter, Section 3.6 presents related work in wrapper inference for information extraction.

3.1 Anomaly-Based Intrusion Detection

Scarfone and Mell [378] define an intrusion detection system (IDS) as a system that monitors another system or network for symptoms or evidence of violated security constraints, i.e., attacks. Intrusion detection capabilities are present in many security products today, e.g., intrusion detection and prevention systems (IDPSs and IPSs), next- generation firewalls, unified threat management appliances, web application firewalls, web and service security gateways, security information and event management (SIEM), sandbox appliances, and breach detection systems. From a historical point of view, IDS types are distinguished according to the location of the monitor into host and network based. Furthermore, a detection technique for intrusion detection is categorized into misuse or anomaly based.

• Misuse-based intrusion detection is also called knowledge-based intrusion detection because observations are matched for patterns or signatures of well-known attacks, e.g., tokens, patterns, and regular expressions.

• Anomaly-based intrusion detection evaluates observations in a reference model of normal behavior, e.g., a finite state machine, stochastic model, or machine learning profile, and an attack is assumed to deviate from the reference model.

Snort [362] is a well-known misuse-based IDS for network monitoring. A Snort attack signature describes a network-based attack using tokens and regular expressions,

43 CHAPTER 3. LITERATURE REVIEW and IP packet payloads, UDP datagrams, or reassembled byte streams in TCP connections are matched. Suricata [316] is an alternative IDS supporting Snort signatures. Bro [337] is a network monitoring framework for event processing and capable of byte stream reassembling in TCP connections, dynamic application protocol detection, and protocol parsing. The framework includes a policy scripting language, so events and actions can be specified by, e.g., pattern matching in protocol fields and payloads. Bro isoften called network-based IDS; however, it is more powerful because the policy language is Turing-complete. For anomaly detection, there are two approaches in finding a reference model: con- structing from a specification or learning from previous observations. A learning approach requires some form of training to construct a model, and detection techniques based on a specification are often treated as the independent category of specification-based anomaly detection because no training is required [378]. Misuse- and anomaly-based intrusion detection have different advantages and draw- backs. From a conceptual view, misuse detection accurately identifies already known attacks while the number of false alarms is kept to a minimum; however, it misses attacks that are not yet publicly disclosed (i.e., zero-day attacks) or deliberately evade the detec- tion technique (e.g., targeted attacks). Bilge and Dumitras [55] analyze the prevalence of zero-day attacks in malware, and their results indicate that zero-day attacks have become a frequent threat. While anomaly detection can, in theory, identify zero-day attacks, several problems that challenge the practical applicability arise [392]:

• representativeness and quality of training data

• unstable normality because of system updates and evolution (i.e., concept drift)

• sensitivity to false alarms when training data is incomplete

• the semantic gap whether an anomaly is actually an attack

• understandability of anomalies by a human operator

Nonetheless, the theoretical capability of identifying zero-day attacks is a strong motivator for research in anomaly-based intrusion detection. The first anomaly-based IDS, proposed by Denning [112], has six components, i.e., subjects, objects, audit records, profiles, activity rules, and anomaly records. The behavior of subjects with respect to objects is stored in statistical profiles that are instantiated from predefined templates and automatically updated during runtime according to activity rules. The IDS monitors audit trails of an operating system, matches audit records to profiles, initiates associated activities of matching profiles, and eventually generates anomaly records. Notable surveys on intrusion detection are given by Debar et al. [109], Lazarevic et al. [233], and Scarfone and Mell [378]. Modi et al. [276] provide a more recent survey of intrusion detection techniques in cloud computing that also discusses IDS and IPS architectures for the provider side. Anomaly detection in general is studied by Chandola et al. [84] because identifying deviating observations has many application domains, e.g., credit card fraud detection. An observable instance (i.e., event) in this study has a set of attributes, and events eventually have relationships such as ordering in a time series or graph structured data. Based on relationships between events, the survey distinguishes three types of anomalies: point, contextual, and collective anomaly.

44 3.1. ANOMALY-BASED INTRUSION DETECTION

Techniques for point anomaly detection are classification, nearest neighbor, clustering, statistics, information-theoretical measures, and spectral detection. For contextual and collective anomaly detection, techniques either reduce to a point anomaly detection problem, e.g., by embedding events in a vector space, or utilize structure, e.g., sequence modeling using automata and Markovian techniques. In follow-up research, Chandola et al. [85] focus on techniques for anomaly detection in discrete sequential data. They distinguish three detection scenarios: an anomaly in a sequence, an anomalous subsequence in a sequence, and an anomalous frequency of a pattern in a sequence. Techniques are then reviewed and distinguished into sequence based, e.g., Markovian techniques and hidden Markov models; continuous subsequence based, e.g., window scoring; and pattern based, e.g., substring matching.

3.1.1 Network Layer Anomaly Detection Network anomaly detection focuses on network packets and their payloads, e.g., IP headers, TCP segments, and UDP datagrams. Higher-layer application protocols are not assumed; however, all of the reviewed techniques have been evaluated in a web context. The eBayes TCP [422] system performs specification-based anomaly detection by monitoring network packet headers. The reference model is specified as Bayes network approximation of expected TCP connections, and during operation, this Bayes network is continuously updated. Two notable systems for learning-based anomaly detection in networks are Packet-Header Anomaly Detection (PHAD) [251] and Application-Layer Anomaly Detection (ALAD) [250]. Both systems update nonstationary probabilistic models of predefined network traffic properties, i.e., packet header properties inPHAD and keyword frequencies in packet payloads in ALAD. Zanero and Savaresi [497] present a two-stage approach for learning-based anomaly detection. In the first stage, header data is extracted and an unsupervised self-organizing map (SOM) classifies packet payloads. The second stage then analyzes a sliding temporal window over decoded header data and associated payload classes, and a second SOM approximates inter-packet relationships. Wang and Stolfo [480] introduce the payload-based anomaly detector (PAYL). For a destination port and payload length, PAYL learns a payload profile in terms of a statistical histogram over payload bytes. Mean and standard deviation of a profile then allow identifying anomalous packet payloads. While PAYL models 1-grams of payload bytes, Anagram [479] stores distinct payload n-grams1 in a Bloom filter during training and calculates the ratio of unknown n-grams to total n-grams in an observed packet as an anomaly score. PAYL computes a profile for packets of same length. Bolzoni61 etal.[ ] extend PAYL toward POSEIDON which learns PAYL-like profiles for payload classes instead of packet lengths, where an unsupervised machine learning algorithm performs the classification. McPAD [338] is a multi-classifier system that resorts to an ensemble of one-class Support Vector Machines (SVMs) to increase detection accuracy. Informally, an SVM is a geometric large-margin classifier, where an optimal hyperplane for separating a vector space is learned by maximizing the margin between the two classes of examples. One-class SVMs are particular interesting for anomaly detection because only one class of examples is assumed. In McPad, the one-class SVMs are trained from different vector- space embeddings of training data using various kinds of n-grams. HMMPayl [31] is

1Substrings of length n of a string are referred to as n-grams, .e.g., 2-grams(abcd) = (ab,bc,cd).

45 CHAPTER 3. LITERATURE REVIEW another multi-classifier system that models ordering of payload bytes in multiple hidden Markov models. Network flow monitoring is claimed to be more privacy-friendly than DPI monitoring because only metadata about transport protocol interaction is collected, and Sperotto et al. [399] argue that flow-based detection should be seen complementary to packet inspec- tion. The flow-based anomaly detection approach by Lakhina224 etal.[ ] identifies attacks or failures by observing the distributions of certain network traffic features. Changes in flow entropy can correlate with violating activity, e.g., Wagner and Plattner [475] propose Kolmogorov complexity to detect the spread of Internet worms, Tellenbach et al. [406] investigate a traffic entropy spectrum to generalize entropy metrics, several entropy measures have been evaluated by Nychis et al. [291], and Winter et al. [486] provide an algorithm based on exponential smoothing to detect anomalies in an entropy time series. Sperotto and Pras [398] propose time-series analysis using a hidden Markov model to identify dictionary attacks on Secure Shell (SSH) services.

3.1.2 Operating System Layer Anomaly Detection Audit trail and system call monitoring have a tradition in intrusion detection; however, considerable problems arise when traces from various origins are interleaved, e.g., multi- threaded execution and procedures in linked libraries. With respect to specification-based anomaly detection, Ko et al. [212] present a program policy specification language to formalize security properties of a program, and a Unix tool to monitor audit trails for conformance. Follow-up research extends this approach to distributed systems [213]. Wagner and Dean [476] propose techniques to extract a reference model from program code in terms of a finite-state machine over system calls; however, the bisimulation during runtime is computationally intense and simplifications are necessary. The Janus framework [145] resorts to a kernel module in a Linux operating system for intercepting system calls, querying a policy engine, and eventually blocking of calls. Experiences indicate practical problems, e.g., race conditions, indirect resource paths, incorrect state replication, and unpredictable side effects if a system call is blocked. Forrest et al. [135, 182] present a host-based IDS that learns a reference model for anomaly detection by instrumenting n-grams of system call traces, and follow-up research introduces call delays and blocking [391]. Also, several techniques resort to automaton learning from strace system call traces [260, 387]. VtPath [122] uses a modified operating system kernel to directly monitor the call stack and its return addresses of a process to learn execution paths between program execution points. Mutz et al. [281] further extend the system call monitoring approach by combining multiple types of reference models for increased accuracy and including system call arguments in the models. Other techniques that consider system call arguments are presented by Maggi et al. [247] and Frossi et al. [140]; both learn stochastic models from system call traces as reference models for anomaly detection.

3.1.3 Service, Middleware, and User Layer Anomaly Detection One of the first approaches that highlights the importance of application layer protocol language is the service-specific anomaly detection approach by Kruegel et al. [217]; the proposed system learns protocol-specific distributions of segments of bytes in protocol messages.

46 3.1. ANOMALY-BASED INTRUSION DETECTION

Model-based anomaly detection for web applications, deployed in a web proxy, has been introduced by Kruegel and Vigna [218]. The system observes HTTP interaction and extracts six anomaly detection measures from HTTP headers (e.g., character distribution and attribute lengths) to detect attacks that affect header fields, e.g., in query parameters in a requested URI path. Follow-up research includes inferring attack types using heuris- tics [361], leveraging concept drift in web applications [248], and querying a knowledge base when training data is scare [360]. Masibty [102] operates as a reverse proxy in front of a web application and additionally monitors SQL database calls performed by the web application. Both headers and content in web requests and responses are analyzed by multiple anomaly detection modules, and Masibty correlates web requests, triggered SQL queries, and web responses to identify violating return values. Kirchner [209] and Krüger et al. [219] present two approaches also operating as a web reverse proxy. Both techniques extract features from web requests and responses, and while Kirchner’s approach resorts to a nearest neighbor technique for anomaly detection, the TocDoc system by Krüger et al. implements an ensemble of anomaly detection techniques and supports automatic healing of web requests. In follow-up research, Krüger et al. [220] resort to vector space methods, i.e., matrix factorization, to automatically extract sections that co-occur in network payloads, and use cases include anomaly detection. Several network-based anomaly detection techniques are strictly tailored for HTTP as application protocol, where web requests and responses are analyzed for anomalies. The proposed technique by Ingham et al. [191] tokenizes request headers and learns a finite automaton as reference model. Spectrogram [395] resorts to multiple Markov chains for analyzing the URI-path structure in an HTTP GET request and payload in an HTTP POST request. Multiple hidden Markov models in HMM-Web [99] analyze the sequential structure of URI paths in web requests. In own previous research [228, 229], we propose analysis of web request URI paths using an online learning variable-order Markov model that continuously adapts to the observed service. Boggs et al. [60] presents a distributed anomaly detection system, where so-called content anomaly detectors are distributed over varying websites to detect wide-spread simultaneous attack activity. Detectors exchange low-confidence identifications for a collaborative decision to increase accuracy and reduce false alarms, and every detector implements an n-grams-based technique from Anagram [479]. For monitoring the internal state of a web application, Cova et al. [100] present Swaddler. The system automatically instruments web application code when executed in the PHP interpreter by inserting monitoring statements, and events are forwarded to an analyzer component that constructs profiles, e.g., for variable values, using various anomaly detection techniques. Xie and Yu [490, 491] propose monitoring of user browsing behavior to protect web applications from distributed Denial-of-Service attacks. Web request headers provide resource descriptions, i.e., URI paths and content types, and references that identify users. The normal browsing behavior of users in terms of sequential resource accesses is then modeled in a hidden semi-Markov model.

3.1.4 Language-Theoretic View on Intrusion Detection The LangSec view [69, 376] explains practical problems encountered in intrusion detec- tion when there is an expressiveness gap between anomaly or misuse detection model and the audit data, e.g., network packets, HTTP requests, protocol messages, and system

47 CHAPTER 3. LITERATURE REVIEW and API call traces. An IDS in fact monitors the language in an interaction or emitted by runtime behavior and decides MEMBERSHIP. The true power of the observed language could be Turing-complete in a worst case (e.g., a JavaScript program), but an IDS needs to decide MEMBERSHIP efficiently. Analysis techniques for intrusion detection are usually language approximations, and the consequences are errors. Figure 3.1 shows the types of errors in anomaly-based intrusion detection. While a true negative refers to an acceptable observation, a true positive identifies an anomalous observation, i.e., non-acceptable inputs. It should be noted that an anomaly is not necessarily an attack. An IDS is an approximation that could be incomplete or overly generous. An acceptable input wrongly classified as anomaly is therefore a false positive or false alarm, and a false negative refers to non-acceptable inputs that are classified as acceptable by the IDS.

Anomaly detection reference model True negative False negative

Violating Acceptable

Explicitly permitted

Possible

False positive True positive

Figure 3.1: Anomaly detection errors in a binary classification setting with respect to the relevant sets of behaviors, states, or inputs (reproduced from [58, p. 357])

As an example, suppose a misuse-based IDS uses regular expressions, i.e., the class of regular languages (REG), to describe attack signatures and matches the signatures on observed content. If the expressiveness of the observed content is more powerful than the IDS technique, e.g., Turing-complete JavaScript, then there could be up to infinitely many ways to rewrite an attack, so no signature matches. This form of detection evasion is also referred to as polymorphic attack [396]. The language-theoretic effects haven been encountered by Hadžiosmanovic´ et al. [172] in an experimental evaluation of anomaly detection techniques for networks, i.e., PAYL, POSEIDON, Anagram, and McPAD. These techniques learn from byte-based n-grams, but n-grams are equivalent to stochastic k-testable languages in a strict sense, and this language class can only model local and short-term language constraints strictly within the regular languages [425]. Furthermore, the assumption of a byte-based alphabet is not always sound. The evaluated scenarios are two binary protocols for Windows RPC and the industrial control system protocol Modbus. Both protocols contain length fields which are in fact self references that indicate high expressiveness [376]. The comparative analysis indicates that none of the detection techniques is capable of high detection rates and low false-alarm rates at the same time. The expressiveness gap between actual language class and n-grams in the LangSec view would explain this problem.

48 3.1. ANOMALY-BASED INTRUSION DETECTION

Wressnegger et al. [488] also investigate n-grams with respect to anomaly detection and classification. Three criteria for audit data, i.e., perturbation, density, and variability, have been developed to decide whether an anomaly detection or classification approach is suitable for a certain application scenario. The criteria reflect language properties and completeness of training data, and several showcase examples indicate that anomaly detection based on n-grams is unreliable, when perturbation is high.

3.1.5 Intrusion Detection Evasion An attacker does not want to be detected by an IDS, and three popular approaches to circumvent detection routines are: exploiting the computational complexity of preprocess- ing and analysis in a monitor, evading detection, and poisoning of training data in an anomaly-based IDS. If an attacker knows the detection algorithms in place, a high system load in the IDS could be provoked, e.g., Denial-of-Service by exploiting the exponential slowdown of some regular expression implementations [325]. First techniques to circumvent the observation of network-based attacks, especially for DPI monitors, have been presented by Ptacek and Newsham [346]. A network monitor needs to fully model network protocols; however, some characteristics require computation or memory and are ignored for the sake of efficiency. An attacker can exploit incomplete observations, e.g., by deliberate IP fragmentation, small-sized packets from misuse of TCP flow control, bogus packets with reduced IP Time-To-Live that falsify observations, overlapping TCP segments, provoking timeouts, misuse of TCP Urgent Pointer, and content encodings. Some countermeasures are presented by Handley et al. [173], but Niemi et al. [287] summarize the state-of-the-art in network evasion and show that many evasion techniques still work today. Another form of detection evasion is to avoid known attack patterns in misuse-based detection, and examples are polymorphic exploits [396]. Anomaly detection is conceptu- ally capable of detecting polymorphic attacks; however, Wagner and Soto [477] introduce the notion of mimicry attack that mimics inputs in such a way that the observation is classified as normal. For example, PAYL resorts to a histogram-based reference model over bytes in the content, and a mimicry attack called polymorphic blending [131] adds byte padding, so the overall byte distribution is considered normal. Anomaly detection estimates a reference model from training data, but if an attacker is able to modify or influence the training data, future attacks will not deviate fromthe trained model. Tampering with training data is referred to as poisoning [365]. A special poisoning attack for online learning techniques is described by Chan-Tin et al. [83] as frog-boiling attack: the attacker sends explicitly permitted messages to continuously weaken the reference model until the true attack is considered as acceptable. Poisoning and learning under the presence of an adversary has motivated research in adversarial machine learning [36, 187].

3.1.6 Kernel-Based Anomaly Detection in Tree-Structured Data Elements in XML syntactically encode a tree structure, even when the semantic data model is a graph. Recent geometric approaches using kernel methods, in particular, tree kernels, have shown promising results in detecting anomalies in tree-structured data [117, 153, 219, 355, 358, 359].

49 CHAPTER 3. LITERATURE REVIEW

Tree Vector-Space Embedding. By turning an arbitrary structure, i.e., a message, from domain X into a vector, tools from geometry become available for analysis. The n- grams method is popular for embedding strings into a vector space of Σn dimensions by | | counting the occurrences of grams. However, n-grams cannot be applied to tree structures, and Rieck [355] therefore proposes vector-space embeddings for parse trees of network application layer protocol messages. Parse trees are ordered, labeled trees t, where x t are the nodes. A feature map ∈ φ : X RN for vector-space embedding translates a tree into an N-dimensional vector, → and vector values are typically frequencies of particular nodes or subtrees in a tree. The considered nodes and subtrees for counting are defined in a so-called embedding set, and the following three embedding sets have been characterized [355]: • Bag-of-nodes. Nodes in a tree are considered independent symbols, and the embed- ding set is therefore the set of labeled nodes. Figure 3.2b is an example for the tree from Figure 3.2a. • Selected-subtrees. The embedding set is defined by a selection predicate over node labels that defines the roots of relevant subtrees for counting. The example in Figure 3.2c assumes a selection predicate that holds for symbols b and c. • All-subtrees. If the definition of a selection predicate for restricting the embedding set is impossible, all possible subtrees need to be counted. However, the size of this embedding set can then be huge or infinite, and checking the presence of subtrees in the embedding set becomes computationally expensive. An example is in Figure 3.2d.

⎛1⎞ a(a(b,c),b(c),c) ⎜1⎟ a(a,b,c) ⎜ ⎟ ⎜1⎟ a(b,c) a ⎜ ⎟ ⎜1⎟ b(c) ⎜ ⎟ φ(t) = ⎜.⎟ . ⎜.⎟ . = a b c ⎛ ⎞ ⎛ ⎞ ⎜ ⎟ t 2 a 1 b(c) ⎜2⎟ a ⎜ ⎟ φ(t) = ⎝2⎠ b φ(t) = ⎝2⎠ b ⎝2⎠ b b c c 3 c 3 c 3 c (a) A tree (b) Bag-of-nodes (c) Selected-subtrees (d) All-subtrees

Figure 3.2: Vector space embeddings of a tree [355, 356]

Kernel Function. A kernel κ : X X R is a symmetric, real-valued function × → that defines a pairwise similarity measure of two elements from domain x,y X as ∈ inner product κ(x,y) = ϕ(x),ϕ(y) in feature space F , and ϕ : X F maps domain ⟨ ⟩ → elements into this feature space [383]. Many geometric relationships in the feature space can be expressed purely in terms of a kernel, e.g., norms, angles, and distances. The mapping ϕ and feature map φ are clearly related (ϕ := φ is a valid feature map for a linear kernel), however, ϕ is more general. Any symmetric, positive semi-definite similarity measure function κ : X X R is a kernel, even when mapping ϕ is not explicitly × → specified [259, 355]. Kernel-based learning methods build hypotheses from geometric relationships in a feature space using a kernel function as fundamental operation that can be efficiently

50 3.1. ANOMALY-BASED INTRUSION DETECTION computed. Furthermore, by choosing a nonlinear kernel function, where the inner product is calculated in an higher-dimensional feature space and ϕ is implicit, nonlinear relation- ships can be learned efficiently. The reader is directed to Schölkopf and Smola[383] for more information. For example, in SVMs for classification, a hyperplane in feature space represents the decision boundary between vector representations of samples. A Support Vector Data Description (SVDD) [405] is a special one-class SVM, where a hypersphere encloses the training examples in feature space, and anomalies are outside the hypersphere. With respect to anomaly detection, kernels are proposed for global and local geometric detection approaches [355, 359].

Tree Kernel. A kernel can be specified for arbitrary structures (i.e., convolution kernels), and first applications with respect to structured data have been parse tree analysis in natural language processing [96]. Intuitively, the number of shared subtrees between two trees is a valid inner product and therefore a similarity measure. However, a naive recursive counting algorithm leads to a runtime exponential in the number of tree nodes. Rieck [355] defines a generic tree kernel as

κ(s,t) = φ(s),φ(t) = ∑ ∑ c(x,y). ⟨ ⟩ x s y t ∈ ∈ Function c is a counting function for a particular embedding, i.e., bag-of-nodes, selected-subtrees, or all-subtrees. To handle the exponential runtime for the selected- subtrees and all-subtrees embedding, a dynamic programming algorithm is proposed [355]. Rieck et al. [356] furthermore introduce the notion of approximate tree kernel for runtime improvement: to avoid counting of all subtrees in the all-subtrees embedding set, an opti- mization routine precomputes an approximated selection predicate for a selected-subtree embedding that reduces runtime while retaining as much expressiveness as possible.

Global Anomaly Detection. Rieck [355] considers anomaly detection as an unsuper- vised learning task using two functions: a learning function for calculating a model from normal examples, and a prediction function for assigning labels to samples based on the learned model. An intuitive model for global detection is a hypersphere that encloses the feature vectors of normal training data, and the assumption is that attacks reside outside the hypersphere. Figure 3.3a demonstrates the concept, and there are two approaches for hypersphere-based anomaly detection: center-of-mass and SVDD hypersphere.

Sample Sample

R µ Training examples

(a) Hypersphere in feature space (b) Neighborhoods in feature space

Figure 3.3: Global and local anomaly detection in feature space [359]

51 CHAPTER 3. LITERATURE REVIEW

In the center-of-mass approach, the central vector µ is the average of all training examples. By “kernelizing” the learning and prediction functions, all geometric operations are expressed in terms of kernel operations, and an arbitrary kernel can be chosen. However, if feature map ϕ is implicit (e.g., in a nonlinear kernel), µ cannot be directly calculated, and computing the distance of a sample to the hypothetical center then requires a kernel computation with every example in the training data. The prediction function is then a threshold on the distance of a sample to the center. It should be noted that the hypersphere in the center-of-mass approach is not necessarily optimal. Krueger et al. [219] resort to a center-of-mass approach over byte content n-grams in the content anomaly detector in the TokDoc web application firewall. The SVDD approach considers the learning function as an optimization problem to find the center µ∗ of a volume-minimal hypersphere in feature space. This approach can also be rephrased in terms of kernel operations, so the center becomes implicitly defined by a set of weighted feature vectors in the sphere’s decision boundary, the so-called support vectors. The prediction function is again a threshold on the distance of a sample to the center of the hypersphere. If kernel mapping ϕ is explicit, µ∗ is also explicit, and the distance can be computed straightforwardly. If ϕ is implicit, the distance of a sample necessitates a kernel operation for every support vector. Düssel et al. [117] resort to an SVDD approach for anomaly detection in HTTP interaction.

Local Anomaly Detection. While nonlinear kernel functions enable nonlinear decision boundaries for hypersphere approaches, there is still a global perspective. To consider local perspectives of neighborhoods in feature space, Rieck [355] proposes kernelized variants of the k-nearest neighbors method, in particular, the Gamma and Zeta anomaly scores and prediction based on a score threshold. The Gamma anomaly score [174] of a sample is the mean distance to the k-nearest neighbors in feature space, however, this score is density-dependent and it could be impossible to set a good threshold. The Zeta anomaly score [357, 358] normalizes the Gamma score by the mean inner-clique distance. It should be noted that both Gamma and Zeta scores can be expressed purely in terms of kernel operations, and they do not require an explicit learning phase. However, prediction is computationally more intensive because the sample’s distances to all examples have to be computed using kernel functions. For finding an optimal threshold, a calibration based on cross validation is proposed. Rieck and Laskov [357, 358] and Rieck et al. [359] experimentally evaluate the Zeta anomaly score approach for intrusion detection in network traffic and Voice-over-IP infrastructures respectively.

Active Learning. Both global and local anomaly detection assume an unsupervised learning scenario and require attack-free training data. But attack-free training data is expensive compared to unlabeled data. Görnitz et al. [164, 165] rephrase anomaly-based intrusion detection as active learning task to facilitate expert knowledge. The authors propose an active learning strategy, and the ActiveSVDD model considers both labeled and unlabeled examples for finding an optimal hypersphere. First, an initial hypersphere is computed from unlabeled examples. The active learning strategy then chooses examples that are both near the decision boundary and member of a small cluster of potential anomalies with respect to a k-nearest neighbor graph. An expert is queried to provide labels for the chosen examples, and an improved hypersphere is computed. Several alternations of this procedure are typically necessary until a stable reference model is reached. The approach also supports online learning.

52 3.2. THE EXTENSIBLE MARKUP LANGUAGE

Kernel methods for tree-structured data seem promising; however, it is not clear how well the language properties of XML are captured.

3.1.7 XML Anomaly Detection For anomaly detection in XML, Menahem et al. [257, 258] propose a feature extraction process to translate a training corpus of documents into an unlabeled dataset, so existing machine-learning algorithms can be applied. An XSD schema for the corpus is assumed to be available, and based on XSD element definitions, the feature extraction component maps structured XML documents to flattened fixed-length vectors. Some information is lost in the process, in particular, structural properties. For anomaly detection, the authors then describe a one-class classification approach that estimates multiple univariate models from the translated dataset and calculates an aggregate normality score for testing a sample document. This approach focuses on text content, and the idea is not further pursued in this thesis because a schema is assumed, which is not always the case, and the lossy feature extraction violates language-theoretic properties.

3.2 The Extensible Markup Language

XML [451] is the lingua franca of the Internet and electronic data exchange, and its roots go back to SGML. The syntactic specification of XML covers:

• open- and close tags for element names

• element attribute names

• namespaces for element and attribute names

• allowed characters for text content and attribute values, and entities to denote special characters

• comments

• declaration of DTD element types, attribute lists, entities, and notations

• processing instructions for the parser, e.g., XML version and character encoding in the first line of the document

Syntactically, the tags form a tree structure, but an XML document is logically not necessarily a tree. An example is given in Figure 3.4. With respect to element and attribute content, the XML standard only distinguishes two datatypes for text content: parsed character data (PCDATA) as the default for text content and unparsed character data (CDATA) for attribute values and CDATA blocks. Furthermore, XML allows so-called mixed content for markup language characteristics, i.e., tags are allowed within text content, and the review element in Figure 3.4a is an example. A document is said to be well-matched if there is a single root element and open- and close-tags are correctly nested. Therefore, traversing a document from left to right (i.e., document order) equals a depth-first traversal of the syntactic tree. The XML syntax still allows ambiguities, e.g., there are two ways of expressing an element without content. XML Information Set [442] defines a data model for XML to remove such ambiguities. A

53 CHAPTER 3. LITERATURE REVIEW

movie @year 1968 title 2001: A Space Odyssey 2001:A Space Odyssey director </ title> @nid nm0040 <director nid="nm0040"> S. Kubrick S. Kubrick </ director> review <review> A A< em>good</ em> movie. </ review> em good </ movie> movie. (a) XML document with attributes, data and (b) Tree with data nodes mixed content</p><p>Figure 3.4: XML document and it’s tree representation document is said to have an infoset if it is well-formed and all namespace constraints are satisfied.An “ infoset is a logical view of the XML document, rather than the document as stored in a text file”[426, ch. 2].</p><p>3.2.1 Namespaces Qualified names help to distinguish elements that have the same name but different semantics. To uniquely qualify an element or attribute name in an XML document, a name is associated with a namespace URI, a human-readable string to identify a resource in the web [454]. Namespaces are declared in reserved attributes in the open tag of an element, and the scope of the namespace ranges until the close tag; however, nested elements can freely declare their own namespaces. The XML standard distinguishes default namespace and prefixed namespace, and the following example uses both of them: <?xml version="1.0"?> <movie xmlns="http://my.uri/movies" xmlns:director="http://my.uri/directors" xmlns:crossref="http://my.uri/crossreferences"> <name release="en">2001:A Space Odyssey</ name> <director:name crossref:nid="nm0040">Stanley Kubrick</ director:name> </ movie></p><p>The default element namespace is declared in the xmlns attribute, so the qualified name of the first child is the pair (http://my.uri/movies, name). The qualified name of the second child is (http://my.uri/directors, name) because of its prefixed namespace. Attributes are usually considered unqualified, i.e., an attribute is considered part of it’s element’s context and the default namespace does not apply. However, prefixes allow to explicitly declare an attribute from a foreign namespace.</p><p>3.2.2 Parsing The two most notable parsing models for XML are based on a tree representation in a DOM and a sequential representation of start-element, end-element, and characters events according to the Simple API for XML (SAX) [377]. While SAX implements event handling in callback functions which are pushed by the parser, the Streaming API for XML (StAX) [200] works the other way around and pulls a stream of events from a document. Virtual Token Descriptor (VTD) [492] is another parsing model for XML.</p><p>54 3.2. THE EXTENSIBLE MARKUP LANGUAGE</p><p>According to Lam et al. [225], an XML document is parsed in three steps which are summarized in Figure 3.5. While character conversion and lexical analysis are unambiguously specified for all parsing models in the XML specification451 [ ], syntactic analysis creates infoset items in the style of a chosen parsing model, e.g., tree-structured DOM nodes or SAX events. Inline DTD declarations are handled in the syntactic parsing step. Infoset items are then made available to the business logic through the parsing model’s API.</p><p>Business logic Semantic analysis Access</p><p>Parsing model API</p><p>DOM nodes, SAX/StAX events Infoset items Syntactic analysis Schema validation Token sequence Lexical analysis Parsing Character sequence Character conversion Bit sequence</p><p>XML document Figure 3.5: XML parsing steps</p><p>3.2.3 Schema Languages XML is a language family because the standard only characterizes the syntax of docu- ments, but the tree structure of elements and attributes is left unrestricted. A schema is in fact a grammar to characterize a set of XML documents. Schema languages typically operate at the infoset level [426], and there exist several schema languages for XML, e.g., DTD [451], XSD [443], Relax NG [292], and EDTD [223]. The simplest and least expressive schema language is a DTD.</p><p>Definition 1 (Document Type Definition (DTD)254 [ ]). A DTD is a triple D = (Σ,d,e0), where Σ is a set of elements, production rules d : Σ REG(Σ) map element names to → regular expressions, and e0 is the distinguished start element. The right-hand side of production rules is called content model, and L(D) is the set of documents that satisfy d. Example 1. The following DTD is satisfied by the document in Figure 3.4. Attributes are encoded as special elements, and production rules range over elements Σ = movie, { @year, title, director, @nid, review, em . For readability, production rules map to } ε if not explicitly defined. With e0 = movie, production rules d are defined as movie @year title director review+, ↦→ · · · director @nid, ↦→ review em∗. ↦→</p><p>55 CHAPTER 3. LITERATURE REVIEW</p><p>DTD defines the document structure by restricting an element’s content model using a regular expression, and in XML, these production rules can be stated in a DOCTYPE declaration in the preamble of a document. Formally, DTDs correspond to the local tree languages in the definitions by Murata et al.[280]. While a conceptual DTD describes structure, an actual DOCTYPE declaration also necessitates datatypes, i.e., CDATA for attributes and PCDATA for text contents. DTD is simple but has limitations: context- dependent constraints on children of an element cannot be expressed. The industry standard XSD introduces types of elements, so production rules are expressed over types instead of elements to increase expressiveness, and fine-grained datatypes for restricting text content are provided. EDTD, introduced by Papakonstantinou and Vianu [328], is a generalization of XSD with respect to tree structure, and the expressible language class is equivalent to the unranked regular tree languages [286].</p><p>Definition 2 (Extended DTD (EDTD) [328]). An EDTD is a tuple D = (Σ,M,d,m0, µ), where Σ is a set of elements, M is a set of types, µ : M Σ assigns element names to → types, and D′ = (M,d,m0) is a DTD over types. Function µ is a surjection, so an element name can denote different types depending on its context. A document w satisfies D if there is a typing w′ with w = µ(w′), where element names are mapped to types and w L(D ) holds. The type assignment w is called witness for w, and L(D) denotes all ′ ∈ ′ ′ the documents that satisfy D. An EDTD does not specify element and attribute contents.</p><p>Example 2. Refining Example1 with types leads to an EDTD with start type m0 = Movie. Again, for readability, production rules d map to ε by default, and others are defined as Movie @MYear MName Director Review+, ↦→ · · · Director @DirId DName Bio, ↦→ · · Review @RYear AName RText, ↦→ · · RText Emph∗. ↦→</p><p>Mapping µ assigns element names to types and is defined as Movie movie, ↦→ @MYear,@RYear @year, ↦→ MName,DName,AName name, ↦→ Director director, ↦→ Review review, ↦→ @DirId @nid, ↦→ Bio biography, ↦→ Review review, ↦→ RText text, ↦→ Emph em. ↦→</p><p>Schema Validation and Typing Formally, a schema generates a language, i.e., a set of XML documents, and schema validation of a document becomes a MEMBERSHIP decision problem. Stream validation</p><p>56 3.2. THE EXTENSIBLE MARKUP LANGUAGE is then to decide membership in a single pass over the document. With respect to EDTDs, validation coincides with type checking. Typing is then to assign every element in a document a unique type from production rules. For example, semantics in the business logic could be specified by assigning procedures to types, so a procedure is called when a type is encountered during parsing. However, EDTDs can be ambiguous because mapping µ is surjective, and two or more types could be assigned to the same element during typing. Ambiguity is a source of nondeterminism, but determinism is required for efficient schema validation and typing. DTD and XSD resort to syntactic restrictions to guarantee deterministic content mod- els. In particular, regular expressions in DTD production rules must be one-unambiguous regular expressions [73]. A regular expression is deterministic, if while traversing the input every next symbol matches at most one symbol in the expression. For example, the expression (a + b)∗a∗c is clearly not deterministic because in the string ac the prefix a matches in two different factors. XSD, also has syntactic determinism restrictions in production rules, in particular, the Element Declarations Consistent (EDC) and Unique Particle Attribution (UPA) constraints [254]. Informally, EDC forbids two elements that share the same name but have different types in the same content model, and UPA requires every regular expression over types to be one-unambiguous when types are replaced by their elements from mapping µ. Relax NG corresponds to regular tree grammars and allows schema ambiguity in general.</p><p>Schema Expressiveness Syntactic restrictions in schema languages affect expressiveness and consequently schema composability, and effects have been studied by Murata et al. [280] and Martens et al. [254]. First, Relax NG is as expressive as unrestricted EDTD. With respect to XSD, the EDC and UPA constraints guarantee that every type of an element is derivable from ancestor elements only, and because of this single-type restriction, the expressible language class of XSD is EDTDst. Due to this restriction, typing a document in streaming fashion is possible because an element’s type is clear when the open tag is read. However, EDC and UPA are overly strict, and Martens et al. [254] characterize the 1-Pass Preorder Typing (1PPT) property of schemas as a more robust class capable of efficient validation and deterministic typing in a streaming scenario. This property is syntactically enforced in restrained-competition EDTDs (EDTDrc) as defined by Murata et al. [280]. Informally, a production rule restrains competition if the type of an element is determined by its left siblings. The relation between schema languages in terms of expressiveness is therefore DTD ( EDTDst ( EDTDrc ( EDTD.</p><p>Integrity Constraints So far, the discussed expressiveness of schemas is concerned with syntactic tree structures of infosets. But in reality, the practical schema languages DTD and XSD also enable integrity constraints that raise the expressiveness significantly. DTD, as defined in the XML standard, supports declaration of keys (ID) and key references (IDREF, IDREFS) for attributes. References can be distinguished into sequential references, cyclic references and mixtures thereof. The logical data model of an infoset with integrity constraints is therefore not a tree but a directed acyclic graph or infinite tree in case of cycles, and operations such as queries become computationally harder [489].</p><p>57 CHAPTER 3. LITERATURE REVIEW</p><p>Furthermore, XSD introduces additional constraints (unique, key, and keyref ) over text contents, attribute values, and combinations thereof. Checking integrity constraints during schema validation requires significantly more time and space for constructing indices or traversing the data model multiple times, or constraints are not properly checked at all, e.g., stream validation often requires that keys must be declared before referencing them with respect to the document order [24]. It should be stressed that integrity constraints are explicitly defined in a schema, and syntactically, there is no difference between a regular attribute and a reference [451]. Without a schema, there is hardly any evidence in a document about referential semantics of attributes, elements, or text contents. Attributes id and ref are regularly used for ID-based references in standards like XHTML [434], but it cannot be inferred that an attribute ref is universally an ID-based reference to a corresponding id attribute. Key mining in XML is a research field for its own [29].</p><p>3.2.4 XPath</p><p>The XML Path Language (XPath) [432, 456, 474] allows to address nodes in the logical data model of XML using a path notation, and there are three versions of the language continuously extended by functional capabilities for arithmetic and string operations. XPath has a slightly different data model [449] which is based on XML Information Set. It is furthermore the foundation for XPointer [436], document transformation in XQuery [457], and Extensible Stylesheet Language Transformations (XSLT) [450]. The syntax of XPath is designed to be compatible with URIs and PCDATA, so XPath expressions can be embedded as HTTP query part or text. An XPath expression evaluates either to a set of tree nodes, a Boolean truth value, a number, or a string. For example, the node-selecting expression /movie/director applied to the document in Figure 3.4a returns the singleton director element.</p><p>3.2.5 Syntactic Optimizations</p><p>The text-based document format has information redundancies, e.g., repeated element names in open- and close-tags, and there are several approaches to optimize document representation. For text-based optimization, SXML [210] provides an alternative syntax for XML using more compact S-expressions instead of tags. Another approach is to store the document’s infoset in a compact binary format, e.g, Efficient XML Interchange [470], .NET Binary XML [266], ASN.1-based Fast Infoset [196], and Binary MPEG Format for XML [195]. Transporting binary data in XML text content typically requires a binary-to-text encoding scheme like Base64 to satisfy character constraints in element or attribute content. However, this leads to increased length, and another optimization is XML-binary Optimized Packaging (XOP) [452]. XOP is a multipart container format for XML based on MIME type multipart/related, where the document is the first container part, and binary data are natively embedded as subsequent parts. Elements in the document then refer to these parts using the W3C MIME type extensions [444], and XOP is applied in MTOM for SOAP attachments.</p><p>58 3.3. XML ATTACKS</p><p>3.3 XML Attacks</p><p>Many technologies for client-cloud interaction depend on XML as language for content types and protocols: websites in XHTML or XML-compatible HTML5 [473] markup, selective AJAX updates in a web application, web syndication updates in RSS and Atom, call and return serialization in XML-RPC, all specifications for SOAP/WS-* web services, resource representation in RESTful services, and XMPP messaging. An XML document must therefore have an unambiguous interpretation under language-theoretic security principles. However, processing of XML and its schema languages DTD and XSD can be vulnerable in the various steps of parsing. This section reviews the state-of-the-art in XML attacks with respect to parsing and semantics in the business logic, and countermeasures are argued. A special kind of vulnerability, XML signature wrapping [255], is discussed independently because of the relevance to this thesis. Notable surveys of attacks that focus on XML in SOAP/WS-* web services are given by Jensen et al. [202] and Morgan and Ibrahim [278]. Also, the WS-Attacks website [121] provides a comprehensive repository of web service attack descriptions, and the following summary is based on the stated resources.</p><p>3.3.1 Parsing Attacks</p><p>Oversized Tokens. A simple Denial-of-Service attack is placing long tokens in an XML document, e.g., large text contents and long attribute values; long element, attribute, and processing instruction names; many attributes in an element or processing instruction; and large comments. The attack aims for resource exhaustion during lexical and syntactic analysis, so processing slows down or fails. Also a large number of elements can cause Denial-of-Service in a DOM parser that needs to allocate memory for every infoset item. Schema validation could mitigate oversized tokens but a schema is still vulnerable to many repeating elements if repetitions are unbounded [202]. The following XML document is an example: <?xml version="1.0" encoding="utf-8"?> <movie veryLongAttrName1="large␣value" veryLongAttrName2="large␣value" ...> <title>Very Long Text Very Long Text ...

Coercive Parsing. Another Denial-of-Service attack is coercive parsing [121]. Syn- tactic analysis typically resorts to a pushdown automaton for processing of nested el- ement tags, and a large amount of nested structures that also declare a large number of namespaces exhaust the pushdown automaton’s stack and namespace data structure. While schema validation can fend of deeply nested structures early during parsing, neither the number of namespace declarations per element nor the length of names- pace URIs are bounded in the XML standard [202]. An example looks as follows: ...

59 CHAPTER 3. LITERATURE REVIEW

Entity Expansion Attack. XML parser libraries typically support an inline document type declaration (i.e., DOCTYPE) for specifying a DTD grammar and entities in the preamble of an XML document. An entity is a macro in a document, and it can be declared internal or external. The entity expansion attack, also known as “billion laughs” attack or XML bomb, expands exponentially many internal entities to use up memory and time [121]. Any implementation that supports inline DOCTYPE declarations is potentially vulnerable, and the best countermeasure is to disable inline declarations at all. The following XML document is an example: ... ]> &x100;

External Entities. Another vulnerability that arises from inline DOCTYPE declarations is resource inclusion through external entities [121]. An external entity can refer to a URL, including local file systems, and syntactic analysis retrieves the resource during parsing if the feature is not explicitly disabled in the parser. Again, the best defense is to disable inline entity declarations. The following example is a message sent to an XML-processing service that updates a database. Instead of a name, the service retrieves the name value from a local file, stores it in the database, and the file contents are revealed when the attacker queries the database: ]> 1000 &file;

Server-Side Request Forgery. Depending on the parser implementation, various URI schemes are directly supported for external entity retrieval in the parser, including HTTP GET operations. A knowledgeable attacker can therefore invoke URLs, e.g., an HTTP- based attack, originating from the service while the attacker stays hidden to the victim. This attack is also referred to as server-side request forgery (SSRF), and instead of external entities, the following example uses the DOCTYPE declaration to call an internal web service [278]: Morgan and Ibrahim [278] furthermore discuss an SSRF attack that exploits names- pace declarations in XML elements. In an element, the attributes xsi:schemaLocation and xsi:noNamespaceSchemaLocation are hints for the parser, where the schema for a certain namespace or default namespace can be found, and an SSRF attack would be triggered. However, the parser needs to explicitly support the feature of automatic schema retrieval, and strict schema validation is a countermeasure by restricting attributes. An example looks as follows:

60 3.3. XML ATTACKS

3.3.2 Semantic Attacks Business logic accesses a document’s infoset items by, e.g., handling a stream of SAX/S- tAX events or walking over tree nodes in a DOM. Moreover, XPath expressions can select a set of nodes in the tree representation. The fundamental idea of a semantic attack is to modify the infoset and its text contents, so the target’s business logic accesses and processes violating data that leads to violating behavior by the program, its software components, or other contacted services.

CDATA Field Exploitation. A generalized approach to exclude character data from further parsing in a document is a CDATA field428 [ ]. XML strictly specifies the allowed characters in text content; however, CDATA allows a much larger character set. Lexical analysis considers a CDATA block as a single token and only strips the CDATA brackets. It is the foundation for many other attacks. CDATA allows to store XML fragments in text content without creating infoset items, and a potential application is XML injection, e.g., when an XML-based message is exchanged, parsed, and serialized by several services, so at some location, a CDATA block could become valid XML. Strict schema validation in a service detects violating structures and text contents.

XML Injection. Consider a service, where a user provides some value for an oper- ation, and the service composes an XML message to call a second service, where the user-provided value and fixed parameters are arguments. The business logic ofthe second service then resorts to XPath expressions to select nodes from the message. If the user-provided value is unfiltered XML, an attacker could inject a fragment that changes semantics of the message, e.g., the user-provided value in following document creates additional infoset items when parsed by the second service: value< p2>maliciousValue< p1>value fixedValue

In this case, the XPath expression //p2[1] returns the first node, where the element name matches p2. The fixedValue is acceptable; however, the injected XML fragment matches the query before. Injecting XML fragments into a document can affect XPath access and even affect state in SAX parsers if tree contexts are not handled correctly [121]. Schema validation is a countermeasure.

Attacks in Text Content. The business logic often involves other software components and services, and an inconsistent datatype could lead to violating behavior. Exam- ples are: path traversal in file systems, when text is interpreted as file system path;

61 CHAPTER 3. LITERATURE REVIEW injection of XPath, Structured Query Language (SQL), Lightweight Directory Access Pro- tocol (LDAP), and command fragments; memory corruption to manipulate control or data flow in the business logic or one of its components; and cross-site request forgery (XSRF) and cross-site scripting (XSS) with respect to web applications [428, 206, 202]. The following document demonstrates a command injection. In the example, the business logic assembles a Unix shell command by concatenating an internal command template with the value of param, and a knowledgeable attacker can therefore inject arbitrary Unix commands: 7313

CDATA fields enable various injection attacks, e.g., the following document hides JavaScript code, encapsulated by a script tag, in text content. When the value is accessed by the business logic and stored as text in a database, the CDATA fields are stripped. If the database value is reused in a web application, an accessing web client executes this JavaScript code: script]]> alert(’This is JavaScript code’) /script]]>

3.3.3 Schema Extensibility and Effects on Validation Attacks in inline DTD in a document can be prevented by disabling inline DTD support in the parser. Strict schema validation, including datatype checking for text content, can further mitigate parsing and semantic attacks because wrong interpretation by the business logic is minimized. XSD is the industry standard and designed to be extensible. There are two philoso- phies of code reuse and modularity: schema subtyping by reusing complex types, type extensions and restrictions, and substitution groups [242, 243]; and schema extension points for loose coupling based on xs:any. While schema subtyping shares similarities with classes in object-oriented programming, a schema extension point is similar to a wildcard for any kind of infoset item in a document. In case of schema subtyping, validation of a document is unchanged; however, a schema extension point permits any nested element at the particular location in documents and therefore introduces a potential vulnerability. The schema fragment in Figure 3.6 is from the SOAP specification447 [ ] and demon- strates extension points. SOAP defines types for envelope, body, and header of a SOAP message. Nested elements are not further specified by placing xs:any el- ements as extension points in the schema. According to the XSD specification, the processContents="lax" attribute has a tremendous effect on schema validation: If the item has a uniquely determined declaration available, it must be -valid- with respect to that declaration, that is, -validate- if you can, don’t worry if you can’t [443]. In other words, the uniquely determined declaration refers to an XSD schema associated

62 3.3. XML ATTACKS with an element’s namespace. This schema is either already present in the parser’s en- vironment or dynamically retrieved from a schema location. In case of an error, e.g., unknown namespace or failed schema retrieval, schema validation skips the element. As a consequence, a knowledgeable attacker can place arbitrary content in a document, even when schema validation is activated, if a schema has extension points with the processContents="lax" attribute.

...

...

Figure 3.6: Schema extension points in the SOAP XSD schema

SOAP and XMPP are two relevant XML-based protocols in today’s client-cloud interaction technologies, and they rely on schemas that have extension points. Even XSD best practices refer to the any element wildcard for flexibility by arguing that it is better to err on the side of too many wildcards than too few [400]. Schema extension points can easily emerge by accident because in XSD, the default element type is anyType if not explicitly specified. A schema extension point turns a validating XML parser vulnerable to many parsing and semantic attacks. Furthermore, schemas are not always available for validation, e.g., RESTful web services, and AJAX web applications support XML as resource format but do not enforce the presence of a schema. In 2010, the general problem of ad hoc XML document design has been studied by Grijzenhout and Marx [169]. To understand the quality of XML data in the web, the authors gathered a large XML corpus for analysis, where 85.4% are shown to be well-formed. Only 25% of the documents refer to a DTD or XSD schema, and only 8.9% of all documents are actually valid. The study however focuses on XML that can be retrieved from the web, including XHTML websites, and the results do not translate to SOAP/WS-* web services, where all protocols are specified in XSD.

3.3.4 XML Signature Wrapping Attack The XML signature wrapping attack [255], also referred to as XML rewriting attack, is a semantic attack. The fundamental problem is that signature verification is treated as an independent Boolean decision over a document, and the business logic is not prevented from reading an unchecked element. Moreover, the XML signature wrapping attack exploits schema extension points, so validating parsers cannot reject the attack.

63 CHAPTER 3. LITERATURE REVIEW

XML Signature XML Signature [453] is a W3C digital signature standard for integrity, message authenti- cation, and signer authentication in XML. Figure 3.7 shows the informal XML structure of a signature according to the standard. A ds:Signature has a parameter structure to be signed (ds:SignedInfo), a signature value that verifies the parameter structure (ds:SignatureValue), optional information about the signer and his cryptographic keys (ds:KeyInfo), and eventually some referenced elements (ds:Object). The param- eters of a signature define how to canonicalize XMLds:CanonicalizationMethod ( ) to get an unambiguous text representation for calculating the signature and hash val- ues, the cryptographic method (ds:SignatureMethod), and the referenced elements (ds:Reference). Every ds:Reference locates an element using its URI attribute and has specific preprocessing operationsds:Transforms ( ), e.g., XPath transform. The ds:DigestMethod defines the hashing algorithm to be applied on the transformed refer- enced element, and ds:DigestValue stores the hash value.

( ()? )+ ()? ()*

Figure 3.7: Informal grammar of XML Signature taken from the specification [453]

Calculating an XML Signature is a two-step algorithm. First, all referenced elements are transformed and canonicalized according to the specified method, and their digests are calculated. Canonicalization includes the use of a unified character encoding (i.e., UTF- 8), removal and normalization of whitespcaces, removal of comments, and eventually treatment of namespace declarations. The digests are then encoded in Base64 and stored at their designated position within ds:SignedInfo. In the second step, the canonicalized representation of ds:SignedInfo is cryptographically signed, the signature is Base64 encoded, and stored as ds:SignatureValue. To check an XML Signature, hash values of references and the signature value need to be verified. There are two styles of locating a reference: ID-based references using the URI attribute (e.g., Figure 3.8a and 3.8b) and XPath-based referencing as document trans- formation (e.g., Figure 3.8c and 3.9). An XPath transform expression is a filter, and depending on the algorithm version, semantics are different. An XPath filter [453] is a Boolean expression, evaluated for every node, starting from the URI-referenced or root node. If the expression is true, the node is included in a set for digest calculation. How- ever, filter semantics are different from node-selecting queries and harder to understand, so XPathFilter2 [437] has been proposed for better usability. An XPathFilter2 expression is treated as node-selecting query starting from the URI-referenced node or document root, and the additional Filter attribute allows basic set operations, i.e., intersection, union, and difference, when multiple ds:XPath elements are defined.

64 3.3. XML ATTACKS

root root ...... ds:Signature ds:Signature ds:SignedInfo ds:SignedInfo ds:Reference data ds:Reference @URI #123 ... @URI #123 ds:Object ds:Signature data @Id 123 ds:SignedInfo @Id verified 123 data ds:Reference ... verified @URI ε verified ... (a) Detached signature (b) Enveloping signature (c) Enveloped signature

Figure 3.8: The three types of XML Signature [453]

root ... ds:Signature ds:SignedInfo ds:Reference @URI ε ds:Transforms ds:Transform @Algorithm xmldsig-filter2 ds:XPath @Filter intersect //data/nested data nested ... verified

Figure 3.9: XPathFilter2 expression for referencing elements in XML Signature

Figure 3.8 summarizes the three XML Signature types: detached, enveloping, and enveloped signature. The signature in the document in Figure 3.9 is also detached. WS- Security [295] is an example for detached signature; the ds:Signature structure resides in the wsse:Security element in the message header, and a ds:Reference points to the signed data. As the signature is detached, it could also refer to arbitrary data outside the scope of the XML document. In an enveloping signature, the ds:Signature element becomes an envelope when the signed data is placed in an ds:Object element. An enveloped signature resides as a child of the element to be signed. The URI attribute is empty to denote the document root or refers to the parent element if the whole structure is nested, and a ds:Transforms declaration implicitly uses an XPath filter to exclude the ds:Signature from its digest calculation.

Signature Wrapping Attack Approaches Signature wrapping was first discussed by McIntosh and Austel255 [ ] for SOAP/WS-* web services, and in 2011, Somorovsky et al. [393] have successfully applied signature wrapping to compromise SOAP/WS-* cloud management interfaces of public (Amazon EC2 and S3 services) and private (Eucalyptus) clouds. Figure 1.4 on page7 demonstrates a classic attack, and the precondition for an attacker is to gather a message with a valid signature, e.g., from network sniffing. Attacks boil down to two strategies:

• exploiting ID-based references

65 CHAPTER 3. LITERATURE REVIEW

• exploiting XPath- or XPathFilter2-based referencing

Furthermore, Somorovsky et al. [394] have demonstrated signature wrapping in SAML, a popular authentication and authorization protocol for single sign-on identity management in web applications and cloud computing. A digital signature guarantees authenticity and integrity of a SAML assertion, generated and signed by a trusted identity provider. According to the standard, the signature must be enveloped and should not con- tain any transforms other than the enveloped signature transform [304, p. 69]. Figure 3.10 shows examples for SAML assertions in a SOAP security header and an explicit SAML protocol response, e.g., for RESTful web services. Signature wrapping allows an attacker to state arbitrary trusted assertions if a valid assertion has been captured, e.g., by network sniffing, and the recipient is vulnerable. In 2012, 11/14 SAML frameworks and providers, including cloud services such as Salesforce, were vulnerable to this attack [394].

soap:Envelope soap:Header wsse:Security saml:Assertion samlp:Response @ID 123 saml:Issuer saml:Assertion ds:Signature @ID 123 ds:SignedInfo saml:Issuer ds:Signature

verified ds:Reference ds:SignedInfo @URI #123

verified ds:Reference ... @URI #123 soap:Body ... (a) SAML over SOAP (b) SAML protocol, e.g., for REST

Figure 3.10: Examples for XML signatures in SAML assertions

Signature Wrapping Countermeasures

Security Policy. McIntosh and Austel [255] have also proposed a security policy for WS-Security as a countermeasure, in particular: the signature must reside in the message header, a trusted X.509 certificate has to provide the signature verification key, andtwo signature references must point to the IDs of elements /soap:Envelope/soap:Body and /soap:Envelope/soap:Header/wsse:Security/wsu:Timestamp respectively. Also, Bhargavan et al. [53] propose a policy-driven approach by establishing a formal XML data model, predicates for validity conditions, and a verification procedure based on applied π-calculus to show protocol properties. Results have influenced the development of a rule-based policy adviser [54] that evaluates a given WS-SecurityPolicy file for ex- pressing security assertions on messages for a web service. The policy adviser analyzes for vulnerabilities, including the signature wrapping attack, and the policy rules must include: presence of mandatory elements in a message (soap:Body, wsa:To, wsa:Action); all mandatory elements, wsa:MessageID, and wsu:Timestamp are signed; and X.509 cer- tificates for authentication are recommended. However, Gruschka and Iacono171 [ ] demonstrate a successful signature wrapping attack on the Amazon EC2 cloud service that satisfies the mentioned security policy approaches.

66 3.3. XML ATTACKS

Inline Approach. Rahaman et al. [349] acknowledge that a signature wrapping attack modifies the structure of a message, and they propose an inline approach, i.e., adistin- guished SoapAccount element in the wsse:Security header of a message that captures characteristics of the message itself and is also signed. These characteristics include: the number of child elements in soap:Header and soap:Body, the number of signature references, and predecessor and successor element names of referenced elements in the signature. An attack is assumed to change these characteristics and could therefore be de- tected; however, this approach is not standardized. Gajek et al. [143] identify weaknesses in the inline approach if there is a single unsigned element in the header, and the authors were able to construct realistic attacks that retain SoapAccount characteristics.

Improved Signature Verification. Gajek et al. [143] also propose that signature verifi- cation should not be treated as Boolean decision over a document but rather as a filter. The returned node sets to the business logic are then from the canonicalized XML text of verified references; however, this filter is only applicable, if unsigned message partsare not required for processing. The authors discuss approaches, e.g., by returning a spanning DOM tree over signed parts and their parents or returning precise locations of verified elements in terms of XPath expressions. Similarly, Somorovsky et al. [394] propose a filtered view on verified elements or marking the locations of verified elements forthe business logic as countermeasure against signature wrapping in SAML. In follow-up research, Gajek et al. [142] propose XPath expressions in signature references instead of location-unaware ID-based references. XPath-based referencing can still be vulnerable, and the authors therefore introduce a subset called FastXPath, where wildcard axes are forbidden and node position predicates are mandatory. However, by injecting namespaces and redeclaring namespace prefixes, Jensen et al. [203] exploit the canonicalization process and demonstrate that signature wrapping is still a threat when namespaces are not explicitly specified in XPath expressions.

Schema Hardening. To understand the effectiveness of schema validation as a coun- termeasure against schema wrapping attacks, Jensen et al. [204] introduce schema hardening to generate a unified XSD schema for a web service. In particular, thefol- lowing weaknesses in XSD schemas are identified for removal: xs:any wildcards, processContents="lax" attributes that instruct the parser to skip validation if no schema is available, and namespace="##any" or namespace="##other" attributes that allow arbitrary namespaces. For an attacker it therefore becomes impossible to move the originally signed content to an unchecked location in the document because schema validation using the hardened schema eliminates all extension points. However, all required WS-* schemas need to be merged into a single XSD schema file, which is computationally challenging because of combinatorial effects, and the authors have reported a dramatic increase in validation time using a hardened schema. Mainka et al. [252] introduce XSpRES (XML Spoofing Resistant Electronic Signature) as a standard-compliant and comprehensive framework that unifies several XML attack countermeasures including schema validation. XSpRES defines client- and service- side modules for signature reference verification, namespace prefix transformation for unambiguous FastXPath referencing in signature creation, Denial-of-Service protection, and schema hardening for validation. However, only a few WS-* schemas are considered for schema hardening because of complexity problems.

67 CHAPTER 3. LITERATURE REVIEW

3.3.5 Language-Theoretic View on XML Attacks In the LangSec philosophy, many security vulnerabilities are caused by ambiguous or overly expressive input languages, and this is also the case for XML. Basically, the attacks demonstrate weaknesses of the XML language specification that originate from missing restrictions or legacy SGML such as semantics of inline DOCTYPE declarations and entities. An very first countermeasure is to completely disable inline DOCTYPE support in XML processors [278]. However, this is not always possible, e.g., XHTML requires an external entity declaration to resolve HTML entities. The signature wrapping attack and the various unsuccessful countermeasures also demonstrate a more fundamental problem. While the XML syntax is type-2 in the Chom- sky hierarchy (context-free language) and infoset items are tree structured, a logical data model of a document is not always a tree because of integrity constraints. XML Signature relies on references, and together with schema extension points in most protocol standards, a resilient implementation turns out to be difficult393 [ , 394]. An implementation that utilizes XML Signature must make sure that the business logic processes the elements that are actually hashed in the signature.

3.4 Stream Validation

Schema validation has been proposed as a countermeasure for various parsing and semantic attacks in XML; however, schema extension points in many of today standards allow arbitrary nested content, and an attacker can often bypass validation. Schema hardening, as studied by Jensen et al. [204], is a potential solution by removing extension points, but the slowdown in hardened schema validation is a considerable drawback. The authors state that the effectiveness of countermeasures necessitates event-based XML processing (i.e., SAX or StAX), so invalid messages can be rejected at an early stage. Protocols such as XMPP establish an open-ended XML stream for exchanging well- formed XML stanzas. A validation mechanism therefore needs to be capable of stream processing, and first results on XML stream validation have been presented by Segoufin and Vianu [386]. Assuming constrained memory, the authors propose a finite-state machine over open- and close-tag events for strong validation of nonrecursive DTDs. For recursive DTDs, memory constraints are relaxed to allow a stack bounded in the depth of the document to be validated. However, pushdown automata are not appealing as they do not define a robust class of languages: they are not closed under complement, inclusion is undecidable, etc. [223].

3.4.1 Visibly Pushdown Automata Kumar et al. [223] consider the language class of XML streams of open- and close-tag events as visibly pushdown languages (VPLs), a subset of the deterministic context-free languages (DCFLs), and the authors propose XML visibly pushdown automata (XVPAs), as a more suitable language representation. VPLs have a history in program analysis, and a VPL has a pushdown alphabet that is composed of three disjunct : Σcall, Σret, and Σint. The respective names of these alphabets originate from modeling of programs, where procedure calls and returns are distinguished from internal actions.

68 3.4. STREAM VALIDATION

Definition 3 (Visibly pushdown automaton (VPA) [10, 223]). A VPA A accepts a VPL over alphabet Σ = Σ Σ Σ and is defined as A = (Q,q ,QF ,Γ,δ), where Q is call ⊎ int ⊎ ret 0 a finite set of states, q Q is a distinguished start state, QF Q are final states, Γ is a 0 ∈ ⊆ stack alphabet that contains a special symbol Γ for indicating an empty stack, and ⊥ ∈ the transition relation is δ = δ call δ int δ ret: ⊎ ⊎ • δ call Q Σ Γ Q ⊆ × call × × • δ int Q Σ Q ⊆ × int × • δ ret Q Σ Γ Q ⊆ × ret × ×

Semantics. The alphabet in a VPA is partitioned into disjoint sets Σcall, Σint, and Σret, and stack-affecting inputs become visible. A call transition (q,c,γ,q ) δ call on input ′ ∈ c Σ pushes γ on the stack and moves from state q to q , a return transition (q,c,γ,q ) ∈ call ′ ′ ∈ δ ret on input c Σ pops γ from the stack and moves from state q to q , and an internal ∈ ret ′ transition (q,a,q ) δ int on input a Σ leaves the stack unchanged and moves from ′ ∈ ∈ int state q to q . The total VPA state in a run is captured in a configuration Q Σ Γ , and ′ × ∗ × ∗ a VPA therefore characterizes a relation T over configurations. The start configuration is (q ,w, ), and (q,ε, ) for some q QF indicates a final configuration. The VPA 0 ⊥ ⊥ ∈ transition relation δ induces configuration transitions in T, i.e., ((q,cw,v),(q′,w,vγ)) for a call transition, ((q,cw,vγ),(q′,w,v)) for a return transition, and ((q,aw,v),(q′,w,v)) for an internal transition. A run on input w is then a sequence of configurations σ0,...,σn such that every (σi 1,σi) T, and a run is accepted if σn is a final configuration. Furthermore, − ∈ w a run over string w is denoted as (q , ) (q,v). 0 ⊥ −→A VPAs have appealing properties: they capture the unranked regular tree languages, are closed under complement and set operations, and every VPA can be determinized [10, 223]. Intuitively, VPAs capture languages, where the stack operation is visible from the input symbol.

3.4.2 XML Visibly Pushdown Automata XML is a stricter language class because of the well-matched property. Modular VPAs [9] are a subclass of VPAs for well-matched VPLs with interesting properties, and applications include modeling of boolean programs; XVPAs have been defined as a special case of modular VPAs. This subsection recalls the definitions by Kumar et al.[223] which focus only on structural XML aspects, and data is ignored.

Definition 4 (Modular VPA [223]). Let Σ be the set of element names and M a set of mod- ules that represent content models. The surjective mapping µ : M Σ assigns elements to → modules. A modular VPA A over (Σ,M, µ) is then a tuple A = ( Qm,em,δm m M,m0,F), { } ∈ where for every module m:

• Qm is a finite set of module states

• e Q is the module’s entry state m ∈ m • δ = δ call δ ret, where: m m ⊎ m

c/qm – δ call q e n µ 1(c) and c is an open-tag event m ⊆ { m −−−→ n | ∈ − } 69 CHAPTER 3. LITERATURE REVIEW

ret c/pn 1 – δm qm qn n µ− (c) , c is a close-tag event, and the relation is ⊆ { −−→ | ∈ } c/p c/p deterministic, i.e., q = q whenever q n q and q n q n n′ m −−→ n m −−→ n′ Module m M is the designated start module, and F Q are the final states of the 0 ∈ ⊆ m0 automaton. If µ is a bijection between Σ and M, every element has an individual module, and the modular VPA is referred to as single-entry VPA (SEVPA). Return transitions are deterministic by definition, and a modular VPA is said to deterministic if alsocall transitions are deterministic.

Semantics. Modular VPA semantics are defined by its corresponding VPA. If A is a modular VPA over (Σ,M, µ), then A = (Q,q , q ,Γ,δ) is a VPA over events (Σ Σ), ′ 0 { f } ⊎ where: ⋃ • Q = q0,q f ( Qm) { } ∪ m M ∈ • Γ = Q

⋃ µ(m0)/q0 µ(m0)/q0 • δ = ( δm) q0 em0 q q f q F m M ∪ { −−−−−→ } ∪ { −−−−−→ | ∈ } ∈ The accepted language of module m in modular VPA A is of form L(m) = µ(m)wµ(m). Consequently, the accepted language of modular VPA A is L(A) = LA(m0) = L(A′), where A′ is the associated VPA. Intuitively, a type in EDTD is equivalent to a VPA module, and a module should therefore accept one language. But a single module could accept many well-matched languages based on the current stack and by taking different return transitions, and this behavior violates the regularity of EDTD production rules. A single module needs to accept a single language, and a notion of exit is therefore defined:

Definition 5 (Module exit [223]). Suppose A = ( Qm,em,δm m M,m0,F) is a modular { } ∈ VPA over (Σ,M, µ). The non-empty set Xm Qm is a (pn,qn)-exit for module m, or c/⊆p X (p ,q ) for short, if q X q n q holds. There can be at most one m n n m ∈ m ⇐⇒ m −−→ n (pn,qn)-exit in a module. In other words, all states in Xm(pn,qn) return to the same qn when pn is on top of the stack. Definition 6 (XML visibly pushdown automaton (XVPA) [223]). An XVPA A over (Σ,M, µ) is a tuple A = ( Qm,em,δm,Xm m M,m0,F) such that ( Qm,em,δm m M,m0,F) { } ∈ { } ∈ is a modular VPA, and X is a unique exit for module m M. This single-exit property m ∈ guarantees that all states, where return transitions originate from, behave the same, and every module characterizes the same language independent from where it is called. Kumar et al. [223] have shown useful properties of XVPAs. Every EDTD can be translated into an equivalent XVPA, and for an extended version of XVPAs, this equivalence is shown in Chapter5. A schema is said to be pre-order typed (i.e., it satisfies the 1PPT property) if the schema-equivalent XVPA is deterministic. Furthermore, every XVPA has a deterministic VPA which is capable of deciding MEMBERSHIP of a document in linear time and space bounded by the depth of nested elements [223]. For modular VPAs, Kumar et al. [221, 222] also discuss minimization results and query learning, and VPAs are therefore considered a suitable XML language representation for this thesis.

70 3.5. SCHEMA INFERENCE

Example 3. Let D = (Σ,M,d,m , µ) be an EDTD over elements Σ = ord, itm, det , 0 { } where types M = Order,Item,Details , mapping µ = Order ord, Item itm, { } { ↦→ ↦→ Details det , start type m = Order, and d(Order) = Item+ Details. The corre- ↦→ } 0 · sponding XVPA is shown in Figure 3.11. Modules are represented as boxes, and exits are denoted as accepting states. To ease understanding, the implicit start and final state q ,q of the corresponding VPA are added. Note that text content is ignored. { 0 f }

q0 q f startElement ord ord/q0 ord/q0 startElement itm eo Order do xo endElement itm itm startElement itm /eo itm/x det/d det/do o endElement itm e e startElement det i itm/eo,itm/do d endElement det Item Details endElement ord

(a) XVPA (b) Accepted event stream

Figure 3.11: An exemplary XVPA and an accepted document event stream

3.4.3 VPAs in XML Research The VPA approach to schema validation has motivated further research. For approximate validation of a document, where up to k errors are considered tolerable, Thomo et al. [412] extend the VPA approach and introduce a visibly pushdown transducer (VPT) framework and algorithms, so the tolerable errors reflect only in the required space depending on the schema and k. Approximate schema validation for a document is then decided in a single pass. Schewe et al. [381] model VPAs for schema validation by Abstract State Machines (ASMs) [64], elaborate on updates for documents and schemas by introducing additional rules to the ASM model, and also consider approximate validation up to k errors. Picalausa et al. [340] present a schema framework for streaming validation and schema operations, e.g., equivalence and inclusion checking, by resorting to VPA as unified representation. As an alternative to the XVPA model, Gauwin151 etal.[ ] introduce the equivalent notion of streaming tree automata for stream processing of XML.

3.5 Schema Inference

Learning a language representation is related to the well-known problem of finding a schema that explains a corpora of XML documents. Related work on learning a schema from XML document examples focuses in particular on document structure, and approaches can be distinguished into DTD and XSD inference.

3.5.1 DTD Inference Because of their limited expressiveness in comparison to other schema languages, learning DTDs is in fact learning regular expressions over element names. However, Gold [155]

71 CHAPTER 3. LITERATURE REVIEW has already proven that learning from positive examples only is impossible for the unrestricted class of REG. A common approach is therefore to find language class restrictions, where learnability becomes achievable. For learning regular languages, a generic method of generalization is to merge states in a special DFA called prefix tree acceptor (PTA).

Definition 7 (Prefix tree acceptor (PTA)178 [ ]). Let Σ be an alphabet, S+ Σ be a set ⊆ ∗ of positive examples from a regular language, and Pre f (S+) be the set of all prefixes of + + examples. PTA(S ) is then a DFA A = (Q,Σ,q0,δ,F), where Q = Pre f (S ), q0 = ε, F = S+, and δ = (w,a) wa a Σ and w,wa Pre f (S+) . States are named by { ↦→ | ∈ ∈ } example prefixes, the automaton has a tree structure, and it accepts exactly the examples, i.e., L(PTA(S+)) = S+.

DTD inference goes back to Ahonen’s work on SGML [6]. Her presented methodol- ogy first constructs context-free grammar production rules from positive examples, where the left-hand sides are SGML tags, and the right-hand sides are sequences of SGML tags that have been observed in context of the respective left-hand side tag. In a second step, the right-hand sides of rules with an equal left-hand side are represented as a single PTA, and various generalization methods by merging states from learnable subclasses of REG are discussed, i.e., k-contextual [144], k-reversible [15], and (k,h)-contextual regular languages. The third step refines and disambiguates the grammar, embodied inthe production rules, and constructs one-unambiguous regular expressions [73] for right-hand sides. Fernau [124, 125] discusses DTD inference in a similar setting to Ahonen’s work and introduces the generalized class of function-distinguishable regular languages, where identification from positive examples is feasible. In follow-up work,126 Fernau[ ] charac- terizes the language class of simple looping regular expressions, where one-unambiguous regular expressions can be identified from positive examples. Garofalakis et al. [148] introduce the XTRACT framework for inference of DTDs from a set of example documents. To find regular expressions that are intuitive fora human operator, they propose heuristics to generalize observed sequences of elements into regular expressions, factoring of common subexpressions in candidate DTDs, and composing a near-optimal DTD from candidate DTDs guided by the minimum description length (MDL) principle. Sankey and Wong [374] approach the problem of DTD inference from a corpus of examples by known inference methods. In particular, the authors resort propose a weighted PTA in the spirit of Ahonen’s work [6] and generalization by merging states and updating the probabilities. The transitions and final states of the PTA are annotated by counting from the examples, so a probabilistic finite-state automaton can be derived, and an example is shown in Figure 3.12. Furthermore, the authors propose the minimum message length (MML) as a measure to rate the quality of an inferred DTD. The proposed learning method combines the sk-strings criterion for state merging, i.e., a stochastic relaxation of Nerode equivalence relation, and ant colony optimization to find a DTD with minimal MML. Bex et al. [49] realize that most regular expressions in DTDs are actually simple, and they identify two subclasses of deterministic regular expressions: single-occurrence regular expressions (SOREs) and chain regular expressions (CHAREs). In a SORE, each symbol occurs at most once. For learning a SORE, a PTA is constructed and states are merged; the particular method is discussed in greater detail in Section 5.4.1. A CHARE is

72 3.5. SCHEMA INFERENCE

a (5) c (2) d (1) d (1) d (1) 0 0 1 0 0 1 c (1) b (3) 0 1 c (1) d (1) b (2) 0 0 1 c (1) d (1) b (1) 0 0 1

Figure 3.12: PTA constructed from examples ac, abbcd, abc, acddd, abbbcd and { } annotated by frequencies a special case of SORE that can be inferred directly without an automaton representation in between. Freydenberger and Kötzing [139] present more efficient learning algorithms for SOREs and CHAREs. Bex et al. [48] also characterize the class of i-occurrence regular expressions (i-OREs), another subclass of deterministic regular expressions and a superclass of SOREs, where each symbol occurs at most i times in an expression. A probabilistic algorithm for learning from examples is presented, where for increasing number of i an i-ORE is learned, and the best i-ORE according to the MDL is returned.

3.5.2 XSD Inference Due to types in XSD and generalized in EDTD, schema inference is more complex, and contexts of elements need to be considered. For example, a popular tool for DTD and XSD inference and translation of schemas between schema languages is Trang [92], but the tool does not consider element contexts and types and is therefore restricted to the inference of DTD-expressible languages. Chidlovskii [88] presents an XSD inference approach that adopts an extended context- free grammar (ECFG) formalism as schema representation. Training documents are considered as structured examples of an unknown ECFG. The inference algorithm first creates the set of terminals from element names and a set of nonterminals as types. For a production rule, an element e in a document is encoded as e : t on the right-hand side, where t is either a unique nonterminal (i.e., type) or t = Any if e is a leaf element. Nonterminals with similar context and right-hand sides in production rules are repeatedly merged according to induction rules that also exploit the determinism requirements for XSD. Tight datatypes instead of Any are determined based on the lexical subsumption relationship on elementary XSD datatypes as shown in Figure 3.13a. Finally, right-hand sides of production rules are translated into ranged regular expressions, so an XSD can be generated. Hegewald et al. [176] introduce the XStruct framework for XSD extraction from an XML document corpus based on ideas from Min et al. [271]. XStruct has several modules to generate an XSD file from a set of documents, and the most relevant are: the content model extraction module with a subsequent factoring module, the attribute extraction module, and the datatype recognition module. The content model is built upon the one-unambiguity determinism constraint in DTD and XSD, and the learnable language class is actually restricted to DTD because types are not considered. For every min,max element e, XStruct learns a content model e (T1 ...Tk)⟨ ⟩, where a term Tn is either opt opt → opt opt a sequence term T (s ...s ) min,max or a choice term T (s ... s ) min,max n → n1 n j ⟨ ⟩ n → n1 | | n j ⟨ ⟩ 73 CHAPTER 3. LITERATURE REVIEW

String

float/double long date

binary int unsignedLong

integer unsignedInt nonPositiveInteger nonNegativeInteger String short unsignedInteger Boolean Date Double Time byte positiveInteger unsignedShort Decimal negativeInteger unsignedByte Integer (a) Chidlovskii [89] (b) Hegewald et al. [176]

Figure 3.13: Lexical-subsumption datatype inference reproduced from related work

over nested elements sn which are mandatory or optional (opt = true). Repetition of a term is quantified in min and max. The expressiveness is rather limited, e.g., the regular expression ab(c d ) over nested elements cannot be expressed; an approximation would | ∗ be e (T T ), where T (ab) and T (c d) 0,∞ . The framework creates sequence → 1 2 1 → 2 → | ⟨ ⟩ term content models for all elements observed in example documents, and a factoring algorithm [271] generalizes the terms for every element by, e.g., introducing choices and adding counting constraints. A simple datatype system, as shown in Figure 3.13b derives primitive datatypes for text content also based on lexical subsumption. Attributes are treated independently from content models. Bex et al. [51] propose learning algorithms for the XSD language class, where the context of elements is considered to derive types. Because of the single-type restrictions in XSD, types can be derived from ancestor element paths, and local content models are learned by using the SORE-learning approach from DTD inference. The authors characterize the language class of k-local single-occurrence XSDs (SOXSDs), where up to k ancestor elements characterize a type, and most XSDs fall into this class [254]. A converging learning algorithm for this class is presented, and a second algorithm introduces a similarity heuristic to smoothen types in an inferred schema, so better generalization on small and incomplete training corpora in practical scenarios can be achieved. The algorithms are implemented in the SchemaScope [52] framework. Mlýnková and Necaskýˇ [273, 274] discuss heuristic methods and consider schema inference as a combinatorial optimization problem solved in seven steps. Following the formalisms introduced by Murata et al. [280], XML documents are considered as directed labeled trees and schemas are generalized by regular tree grammars, where production rules define nonterminals by right-hand side regular expressions similar to the workof Chidlovskii [88, 89]. In a first step, an initial grammar is derived such that for anynode n in example documents, there is a production rule n lab(n)(n ,...,n ), where lab(n) is → 1 k the element name and n1,...,nk are child nodes. The second step is to cluster production rules with similar content models and similar context of elements, e.g., edit distances. The third step is to translate content models into regular expressions. The authors propose a state-merging approach that starts with a PTA representation of content models in the

74 3.6. WRAPPER INFERENCE FOR INFORMATION EXTRACTION same cluster and applies heuristic rules to merge states. The fourth step refactorizes the regular expressions to simplify expressions, the fifth step refers to datatype inference of text and attribute content, the sixth step derives evident integrity constraints, e.g., ID-based references, and the last step translates the tree grammar representation into the target syntax, e.g., XSD or Relax NG. With respect to datatype inference, the authors refer to the solutions by Chidlovskii [88] and Hegewald et al. [176].

3.6 Wrapper Inference for Information Extraction

In searching for information in a collection of XML documents or web sites, two prob- lems arise: selecting the relevant documents (i.e., information retrieval) and extracting information from some document. Approaches that learn a wrapper have been proposed to automate the process of extraction by learning from annotated examples what needs to extracted. In particular, Kosala et al. [215] present a learning algorithm that infers a k-testable ranked tree automaton from a set of annotated example documents. However, XML documents and web sites are unranked, and during preprocessing, trees are encoded as ranked binary trees. Informally, a tree language is k-testable if MEMBERSHIP can be decided by looking only at subtrees of height k. The authors consider the k-testable language class as a sufficient approximation for learning a wrapper tree automaton given that k is chosen correctly based on experimental cross validation. Two additional learning algorithms are presented for optimized application as information extraction wrapper. Raeymaekers et al. [348] introduce the notion of (k,l)-contextual tree languages for unranked trees to overcome the tree binarization step in preprocessing. Again, a tree automaton is inferred from marked trees, so information can be automatically extracted from documents or web sites. A tree language is (k,l)-contextual if MEMBERSHIP of an unranked tree can be decided by looking only at its tree fragments, i.e., (k,l)-forks of width k and height l. In terms of learning, a (k,l)-contextual tree language can be learned from examples by simply collecting all allowed (k,l)-forks. The authors propose an algorithm for learning a wrapper and a method to infer parameters k and l from a few counterexamples instead of cross validation.

75 CHAPTER 3. LITERATURE REVIEW

76 Chapter 4

Language-Based Anomaly Detection

To sum up the results of previous chapters, XML and XML-based protocols are the foundation of many cloud services, but their implementations can be vulnerable to various attacks. In particular, the signature wrapping attack has been resilient against countermea- sures because extension points are practically in all schemas for XML-based protocols and integrity constraints (i.e., ID-based references) raise the data model semantically to a directed acyclic graph or infinite tree which requires appropriate handling. Syntactic validation of documents with respect to a hardened schema has been shown to be sufficient as a countermeasure against signature wrapping attacks. Stream validation can furthermore check the syntactic correctness of a document with respect to a schema efficiently. However, computing a unified schema without extension points from a setof loosely composed protocol schemas is hard [204]. This thesis proposes a learning-based approach in a security monitor, where a lan- guage representation of acceptable documents is inferred from examples. A validator component in the security monitor can then syntactically validate incoming XML mes- sages with respect to the inferred representation, and the approach is therefore called language-based anomaly detection. Despite its drawbacks, anomaly detection still has the hypothetical capability of identifying zero-day attacks, and the security monitor is designed according to the following principles:

• Message-level monitoring. In the LangSec threat model, the interpretation of a message could lead to violating behavior, and syntactic correctness mitigates the attack surface. Client-cloud interaction involves various languages and protocols, and an observation point needs to provide a complete and unadulterated view on messages. However, the trend toward pervasive encryption and multihoming proto- cols restricts network monitoring, e.g., DPI technology, and a monitor therefore needs to be placed on the client or service side or on the middleware layer.

• Respecting the language class. When the language classes of inputs and the reference model do not match, there is an expressiveness gap, and errors (i.e., false negatives and false alarms) are the consequence. In practice, an attacker wants to exploit such a gap to evade detection, e.g., polymorphic shellcodes and mimicry attacks. For minimizing errors, the reference model needs to be as expressive as the input language or a good-enough approximation.

• Applicability considerations. When an anomaly is identified, it should be clear to a human operator on what ground the decision was made and where the anomalous

77 CHAPTER 4. LANGUAGE-BASED ANOMALY DETECTION

part is located in the document. Furthermore, to withstand a poisoning attack during training, well-known attacks should be filtered from training data beforehand. The reference model should also provide means of unlearning a discovered poisoning attack and sanitization for removing potentially hidden attacks.

4.1 Problem Description

Several problems arise in language-based anomaly detection, in particular: necessary assumptions about the attacker and system under observation, stream processing, a tractable learning scenario and algorithm, and applicability of the monitoring solution.

4.1.1 Architecture and Assumptions As illustrated in Figure 4.1, the proposed security monitor is interface centric for a particular client or service interface and has three components: a misuse detection component filters well-known attacks, a learner component infers a reference model for the validator component, and incoming messages for a certain interface are then validated by deciding MEMBERSHIP with respect to the inferred reference model. The system under observation could be a client or service, and it should be noted that integrating the monitor directly as an interface component in the system under observation is possible.

Security monitor Messages System X X System ⊗ X under X observation Attacker Misuse Validator detection Learner

Figure 4.1: Proposed language-based anomaly detection architecture

Assumption 1 (Attacker capabilities). The assumed attacker is capable of reading and modifying XML-based messages in transit and sending a violating message directly to the system under observation, i.e., a client or service. Various XML parsing and semantic attacks, as discussed in Section 3.3 can therefore be conducted by the attacker.

When a message is sent to the system under observation, the security monitor performs a two-stage filtering approach before the message is forwarded. First, the misuse detection component maintains a knowledge base to identify well-known attacks. The component is not further specified because it is not in the focus of this thesis. Various misuse detection techniques do already exist in misuse-based intrusion detection [233], e.g., matching for attack signatures or checking XML attachments for malware or viruses. For XML, techniques can also include security policy verification as proposed by Bhargavan et al. [53]. Second, if the message gets clearance from the misuse detection component, the validation component checks whether the message is syntactically acceptable with re- spect to the reference model. The learner component infers the reference model in an

78 4.1. PROBLEM DESCRIPTION independent process from positive example documents which are considered acceptable for the particular interface. The goals of this thesis are specifying a reference model that enables efficient decision making in the validator component and defining a learner that generates a reference model from examples. Finally, the message is forwarded to the system under observation if it is attack free and normal with respect to the reference model. In case of a well-known attack or an anomalous message, a response would be required, e.g., filtering the message and notifying the system under observation. However, responses are not further specified. Identification of a well-known attack is sufficient evidence for skipping the validator component and respond immediately. Under the LangSec threat model, a receiving system interprets a message and even- tually exhibits violating behavior. In particular for XML-based messages, parsing and semantic attacks affect different parts in the receiving system. While parsing attacks could be identified from syntactic restrictions on messages, semantic attacks are hardor impossible to identify because the security violation manifests in the receiving system’s interpretation. Especially in cloud computing, where the client and the service are often considered black boxes, message semantics have to be assumed unknown to the security monitor. Business logic in XML usually binds message semantics to element types and datatypes of text content. Suppose a document has a tree-structured data model without references. The business logic interprets every subtree according to the type of the sub- tree’s root node which is either assigned explicitly from a schema (e.g., pre-order typed) or implicitly assumed in the implemented code. Also, text content is interpreted according to the schema-declared datatype or the implicit datatype assumed in the implemented code. The reference model learned by the learner component should therefore capture the accepted language by learning the accepted types and datatypes from messages and enable efficient decision making in the validator component. Checking types and datatypes inan XML message in the validator component (i.e., schema validation) is then an anomaly detection method if the following assumption holds:

Assumption 2 (Type-consistent behavior). For all manifestations of a certain accepted type or datatype in an XML message, the system under observation has accepted behavior.

In other words, when the type of an anomalous or attack-carrying message part is syntactically indistinguishable from the type of an acceptable message part, the attack cannot be identified in the proposed language-based approach.

4.1.2 XML Stream Validation The security monitor must be capable of stream processing. While SOAP and RESTful web services exchange XML in document form, XMPP specifies a communication protocol based on an open-ended XML stream, where a session starts with the first open tag, ends with last close tag, and well-matched XML stanzas are exchanged in between. An XML document is therefore considered as a stream of startElement, endElement, and characters events in the spirit of StAX processing, and the security monitor has to decide MEMBERSHIP in a single pass. XVPAs, as defined by Kumar et al.[223] and recalled in Section 3.4, are a suitable language representation for stream validation. However, the XVPA definition does not include validation of text content, and an extension is required.

79 CHAPTER 4. LANGUAGE-BASED ANOMALY DETECTION

4.1.3 Grammatical Inference Learning a language model for anomaly detection is related to grammatical infer- ence [178]: to algorithmically learn a language representation (e.g., grammar or au- tomaton) from a certain language presentation (i.e., learning setting). Machine learning and grammar induction algorithms typically compute a model that explains training data with a minimal generalization error. Grammatical inference furthermore assumes the existence of a hidden target that needs to be discovered and emphasizes the quality of the process in identifying the hidden target [178, pp. ix–x]. Intuitively, a grammatical inference process needs to converge to the hidden target, and two relevant issues emerge: the quality of the learned result and the efficiency of the learning process in terms of polynomial time and space [177].

Assumption 3 (Interface language). The XML accepted by an interface is not random but has structure, so a receiving system can algorithmically parse and interpret a message. The structure is either explicitly specified in a schema or implicitly assumed bythe implemented code, e.g., an ad hoc format.

Schema inference is related to grammatical inference, and some of the works discussed in Section 3.5 indicate a grammatical inference approach, e.g., Bex et al. [49, 48, 52, 51] and Fernau [124, 125]. The proposed security monitor observes the language of messages received on a certain interface, and an obvious learning setting is to infer an XVPA from a set of acceptable XML documents. Additionally, for anomaly-based intrusion detection, learning from positive examples is vulnerable to poisoning attacks: if an attacker is able to place a single attack in the training data, the resulting model would be flawed. This problem needs to be considered in the proposed security monitor.

Generative and Discriminative Modeling. The XVPA model is generative, which means, strings from the language can be generated. In machine learning, discriminative models, including the kernel methods discussed in Subsection 3.1.6, are argued to have better efficacy than generative models. In particular, Liang and238 Jordan[ ] show that discriminative models produce lower approximation and asymptotic estimation errors when the model is misspecified, e.g., from incomplete training data or a poisoning attack. Nonetheless, XVPAs are capable of real-time stream processing by design, and they have applicability advantages.

4.1.4 Applicability As discussed in Section 3.1, anomaly-based intrusion detection techniques are exposed to problems that affect the applicability, in particular: representativeness of training data, concept drift in the underlying system, the sensitivity to false alarms, and explanation of detected anomalies. These problems need to be addressed accordingly. First, the security monitor is interface centric by design, and observed messages at an interface are representative for the particular system under observation. Unfortunately, there is no guarantee whether a set of training examples is complete for a certain interface, and in practice, learning progress can only be estimated, e.g., by the number of mind changes in the reference model during learning [178]. The exposure to context drift is minimized in an interface-centric architecture because interface changes also need to be

80 4.2. METHODOLOGY propagated to peers, especially for statically-typed interfaces. For dynamically-typed interfaces, systematic aging of knowledge and continuous learning [228] or verification by learning time-delayed reference models and comparing their concurrent operation [355] are promising strategies. Second, a grammatical inference approach minimizes the expressiveness gap between interface language and reference model for anomaly detection, and errors such as false positives and negatives will therefore be minimal when the learner has converged. It should be noted that grammatical inference needs certain assumptions about the language class in reality, and an experimental evaluation is therefore required. Finally, XVPAs are intuitively to understand by a human operator with basic knowl- edge about finite-state machines. By validating an XML event stream in real time, the stream can be stopped exactly when an anomalous event occurs and before it propagates to the receiving system. The last active state and the position of the anomaly within the stream are directly visible to a human operator or another component for further reasoning.

4.2 Methodology

XML attacks can affect element structure and text contents in documents; however, XVPAs, as defined by Kumar et al.[223], do not consider text content yet. Text contents are strings over the Unicode alphabet and they could be from an arbitrary language unknown to the learner, e.g., natural language; however, an unknown language class affects grammatical inference. To characterize text content, schema languages XSD and Relax NG resort to datatypes, and a first step is a datatype extension for XVPAs to also validate text content. During learning, text contents must be mapped to datatypes, and in a second step, a datatype system for generalization of text is proposed. For grammatical inference in a third step, the considered setting is learning from positive examples which is similar to unsupervised machine learning. For applicability in the designated security monitor, certain refine- ments for unlearning and sanitization are then discussed in a fourth step. Finally, the proposed learning approach involves several generalizations and therefore needs to be experimentally evaluated.

4.2.1 Datatyped Language Representation XSD already provides a datatype system for specifying text content which is also sup- ported in Relax NG. For completeness, the XSD datatype hierarchy [463] is illustrated in Figure A.1 in AppendixA. XSD distinguishes complex types for document structure (i.e., types M in an EDTD), and simple types for element and attribute content. For every primitive and built-in XSD datatype in the hierarchy, the specification defines an unambiguous value space and a lexical space over Unicode strings. Figure 4.2 illustrates this relationship. Lexical spaces in the XSD standard are specified by regular expressions. The differentiation between value and lexical space is necessary because several strings over Unicode can represent the same value, e.g., consider the strings “00” and “0”; both represent the zero value in the integer datatype. In their original definition, XVPAs do not model text content or datatypes, andas a foundation for validation and learning algorithms in the proposed security monitor, datatypes for text content are introduced by XVPA module-internal transitions. Two

81 CHAPTER 4. LANGUAGE-BASED ANOMALY DETECTION

Value spaces Set of datatypes Lexical spaces

int gYear date ... string

Figure 4.2: Value and lexical spaces of XSD datatypes extended classes of automata are therefore proposed: datatyped XVPAs (dXVPAs), where internal transitions range over datatypes, and character-data XVPAs (cXVPAs) for linear- time validation of documents, where internal transitions are predicates over Unicode character data.

4.2.2 A Lexical Datatype System for Datatype Inference Text content can be from an arbitrary unknown language class and a learner can only observe the ambiguous lexical space, e.g., the string “0” is a valid boolean, int, and token XSD datatype. Related work in schema inference has dealt with this issue by inferring datatypes for text content using lexical subsumption. Chidlovskii [88] focuses on numeric datatypes and proposes a datatype system based on XSD datatypes for inferring the minimally required datatype for a set of strings. Datatypes are ordered by subsumption of their respective lexical spaces, and Hegewald et al. [176] take a similar approach by restricting the datatype system even further to seven primitive datatypes. This thesis approaches the problem of text content handling with an improved lexical datatype system that considers all lexically distinguishable datatypes. The datatype system is extensible, and to deal with lexical ambiguity, a string is not represented by a single but by a set of datatypes. The set of required datatypes for some content is computed by identifying the min- imally required datatypes with respect to the subsumption order on lexical spaces of datatypes similar to the approach by Chidlovskii [88]. In a second step, a preference heuristic removes overly general datatypes with respect to semantic specificity and deriva- tion in the XSD type hierarchy order, and only the most specific minimally required datatypes remain.

4.2.3 Learning from Positive Examples Figure 4.3 illustrates the intuitive learning setting. The learner component receives example documents which should be accepted by the interface, and the learner computes an XVPA for the validation component. This learning setting corresponds to Gold’s identification in the limit from positive examples [155], also referred to as learning from text [178], and the following definition is according to Fernau [125].

Definition 8 (Identification in the limit from positive examples [125]). Let L be a target language class (e.g., the regular languages) that can be characterized by a class of language-describing devices D (e.g., a class of automata). Furthermore, E : N L is an → enumeration of strings (i.e., examples) for a language L L , and the examples may be in ∈ 82 4.2. METHODOLOGY

Example documents

XML documents Target Learner at interface Automaton Validator Syntax normal?

Figure 4.3: Learning from positive examples arbitrary order with possible repetitions. Target class L is identifiable in the limit if there exists a so-called inductive inference machine or learner I with the following behavior:

• Learner I receives a sequence of examples E(1),E(2),... as input.

• Learner I reacts by computing a stream of hypotheses (e.g., automata) D1,D2,... such that every D D. i ∈ • For every enumeration of L L , there is a convergence point N(E) such that ∈ L = L(D ) and j N(E) = D = D . N(E) ≥ ⇒ j N(E) Furthermore, a set of positive examples S+ L is said to be characteristic or rep- ⊆ resentative for L if learner I reaches convergence when learning the examples from S+ in arbitrary order [125]. Gold-style learning is a learning process that stabilizes in the limit, and until then, the learner can change its mind in terms of the current hypothesis. In practice, the convergence point can only be estimated by measuring, e.g., the number of mind changes in the learning process.

Unfortunately, this learning setting is hard: a language class that contains all finite languages and at least one infinite language cannot be identified from positive exam- ples only [155]. For example, the regular language class cannot be identified from positive examples. However, there are several restricted subclasses of practical rele- vance, where identification from positive examples is possible, e.g., k-reversible [15] and k-testable [144] regular languages. Fernau [125] introduces the notion of function- distinguishable languages as a generalization of regular language families identifiable from positive examples. The event stream representation of XML is a VPL and even more powerful than regular languages, and to allow inference of XVPAs, a learnable subclass of XML, which closely approximates acceptable inputs and covers most practical cases, needs to be identified.

Proposed Learning Algorithm. The security monitor considers dXVPAs as learning target. Figure 4.4 illustrates the learning process, and learning from positive examples is approached as follows. A first step is to characterize states and transitions that canbe gathered from examples and assign named states to prefixes of document event streams. This leads to the construction of a so-called Visibly Pushdown Prefix Acceptor (VPPA), i.e., a VPA that accepts all the documents learned so far. As a second step, state merging generalizes the VPPA; the parameterized distinguishing function fk,l partitions the named state space into mergeable equivalence classes based on local typing restrictions similar to the SOXSDs in Bex et al. [51].

83 CHAPTER 4. LANGUAGE-BASED ANOMALY DETECTION

Learner Validator

E(1) VPA1 dXVPA1 cXVPA1

E(2) VPA2 dXVPA2 cXVPA2 . . E( j) VPA j dXVPA j cXVPA j

Figure 4.4: The proposed learning process

In Figure 4.4, the intermediate automaton after state merging is denoted as VPAi, and the learner component stores this representation during the learning process. The learner is:

• incremental by updating the stored VPAi with a new example

• set-driven, i.e., the resulting VPAi is the same for any permutation of a set of examples

• consistent with all examples seen so far at any point of the learning process

• conservative, i.e., VPAi stays unchanged if there is no new knowledge to add • strong-monotonic, i.e., i < j L(VPA ) L(VPA ) ⇐⇒ i ⊆ j The output of the learner is a dXVPA, and computing the output involves partitioning the states of VPA j into modules, adding return transitions to satisfy the single-exit property of XVPAs, and minimizing the dXVPA by merging equivalent modules. For convenience, when the learner learns a set of j examples, it is sufficient to compute the dXVPA output from the last VPA j. For later real-time validation in the security monitor, a dXVPA is finally converted into a cXVPA by replacing internal transitions over datatypes by predicates over Unicode text contents.

4.2.4 Anomaly Detection Refinements The proposed learning process needs further refinements to be applicable; in particular, the model should support unlearning, so a once-learned example can be safely removed from the reference model without retraining from scratch, and sanitization to remove hidden poisoning attacks from training. Training data is assumed to be already filtered by a misuse detection component to eliminate well-known attacks. For sanitization, Cretu et al. [101] present a local sanitization process for removing potential attacks from training data by learning disjoint micro-models from partitioned training data and a distributed collaborative sanitization strategy for defending against attacks that persist in local training data. The local approach is based on the assumption that poisoning attacks and anomalies in training data are rare, and this assumption is also considered in this thesis. Sanitization is proposed as follows. The intermediate VPAi in the learner is refined by introducing weights for states and transitions similar to the weighted PTA in Figure 3.12. A weight reflects how often a state or transition has been encountered during learning, and when states are merged in the generalization step, the weights are summed up. For

84 4.3. USE CASES

computing dXVPAi, the sanitization process trims states and transitions whose weights are below a given threshold. Low-frequent information in the training data can therefore be removed. Furthermore, the weight extension allows selective unlearning of a once learned example by decrementing the weights and deleting states and transitions with zero weight. Unlearning can be exploited for handling concept drift in dynamically-typed interfaces, e.g., by unlearning once-learned examples after a certain expiration time, so the model systematically ages, and continuous learning keeps the model up to date [228, 229]. It should be noted that sanitization affects the learner’s properties, i.e., more examples are needed to reach a stable dXVPA output, the learner is not conservative anymore because learning the same example twice still increases the weights, and the learner also loses consistency and monotonicity properties.

4.2.5 Experimental Evaluation The proposed security monitor involves two generalization steps: datatype inference for text contents and grammatical inference for document structure. An experimental evaluation is therefore necessary. A first step is to identify suitable measures for measuring efficiency of the proposed learning approaches, e.g., detection rate, false alarms, or learning progress in terms of mind changes. The second step is to implement various scenarios for evaluation. In particular, two kinds of scenarios are looked into: synthetic and real scenarios. The XML data generator ToXgene [35] allows building collections of documents that satisfy an XSD, and XML attacks are then added manually. For real scenarios, a real web service is developed according to state-of-the-art guidelines, so actual attacks can be simulated and recorded in datasets. In particular, the WS-Attacker [253] tool enables web service penetration testing, including the signature wrapping attack.

4.3 Use Cases

Middleware Security Component. One particular research subject in the Christian Doppler Laboratory for Client-Centric Cloud Computing is the Client-Cloud Interaction Middleware (CCIM) [74, 75, 76]. Figure 4.5 illustrates the CCIM architecture and its components. The middleware is specified in the ambient ASM [74] formalism for modeling distributed systems of communicating and executing agents, including mobile components, and this architectural model integrates novel software solutions for plot- based access control, client-to-client interaction, identity and access management, content adaptivity, SLA management, docker service support, and the proposed approach for security monitoring. Plots allow end-users to combine functions of cloud services offered by various cloud providers and to access the functions using a wide range of devices. Furthermore, plots enable end-users to define sophisticated access control policies on the client side for service operations. The proposed language-based anomaly detection approach is desig- nated as a security monitoring component, integrated in the middleware, for providing a language-based security barrier between clients and services. For a more details on the CCIM, the reader is referred to Bósa et al. [76].

85 CHAPTER 4. LANGUAGE-BASED ANOMALY DETECTION

CCIM Identity and access mangement Clients Cloud A (Tenants) SLA Content S1,...,Si management adaptivity Users

PC Plot-based Cloud B Laptop S ,...,S Mobile access management 1 j

Security Client-to-client monitoring interaction Cloud C S1,...,Sk Docking service support

Figure 4.5: CCIM architecture and components

XML Firewall. An XML firewall is a physical or virtual network appliance formoni- toring in- and outgoing XML traffic of SOAP/WS-* and RESTful services. The purpose is to implement a security policy for XML-based messaging, e.g., access control, filter- ing, message routing, message transformation and canonicalization, schema validation, WS-Security verification, and pattern matching attachments for known attacks toname a few. However, the problems caused by schema extension points in existing standards also affect schema validation in XML firewalls. The proposed learner could integrate an anomaly detection component in an XML firewall by inferring a language representation for a particular client or service interface.

Client Browser Plug-In. While HTML is not necessarily a well-matched language, websites are also often served in an XML syntax, i.e., XHTML and the XML serialization of HTML5. Furthermore, XML has been originally proposed for dynamic updates of a browser DOM in the AJAX paradigm. A fundamental security threat to web browsers is cross-site scripting: the execution of violating JavaScript code and consequences thereof, e.g., priviledge escalation, shellcode execution, cross-site request forgery, illegal informa- tion flows, and drive-by downloads. There are various forms of cross-site scripting [206]; however, they all have in common that an attacker accomplishes to smuggle violating JavaScript into the victim’s runtime environment, nested in a script tag, so the browser interprets the violating code during runtime. Inferring language representations for an XML-based website and its dynamic re- sources could be a solution for detecting violating script tags before they are interpreted by the browser. This could be achieved by a browser plug-in on the client side which learns individual reference models for XML resources in the web.

86 Chapter 5

Grammatical Inference of XML

This chapter presents the main theoretical results. Section 5.1 describes the representation of documents as symbolic event streams according to StAX processing. Section 5.2 focuses on text content and datatypes in mixed-content XML and introduces necessary models and their relationships, i.e., datatyped EDTDs (dEDTDs) as an extension of EDTDs, dXVPAs as a datatyped extension of XVPAs, and cXVPAs for linear-time event stream validation. In Section 5.3, an extensible lexical datatype system for datatype inference from text contents in mixed-content XML is presented. The algorithms and intermediate representations for learning from positive XML examples are then introduced in Section 5.4. Section 5.5 states the properties of the proposed learner and discusses computational complexities of involved algorithms, and Section 5.6 closes the chapter by describing necessary refinements for unlearning and sanitization in a language-based anomaly detection setting. Several restrictions and design choices are made in this chapter. Therefore, Section 7.3 in the conclusion chapter sums them up and argues about their purpose.

5.1 Document Event Stream Representation

The SAX and StAX processing models for XML (Subsection 3.2.2) consider a document as a sequence of events for open- and close-tags, character data of text content, processing instructions, comments, and entity references. The difference between the processing models is in their implementation: while SAX requires callback handlers, StAX provides an iterator over events. Attributes are usually considered as part of open-tags, and the ordering of open-tags in an event stream coincides with depth-first traversal of the syntactical tree. For a concise representation based on StAX events that preserves the syntactic tree structure, a document event stream is defined as follows.

Definition 9 (Document event stream). A document event stream is a sequence of events e1,...,en, and every event carries a value lab(ei). There are three types of events: startElement, endElement, and characters. Processing instructions, comments, and entity references are ignored for the sake of simplicity. Attributes in an open-tag are alphabetically sorted, treated as special elements, and each attribute is translated into a subsequence of startElement, characters, and endElement events. Reserved XML attributes for namespace declarations are removed. The values of startElement and endElement events are qualified element names, the value ofa characters event is the text content as Unicode string.

87 CHAPTER 5. GRAMMATICAL INFERENCE OF XML

To simplify notation, a startElement event is denoted by its qualified name m, and m is the corresponding endElement event. A text between squared brackets denotes a characters event and its text content. Nested CDATA sections are automatically un- wrapped by the parser, and Figure 5.1 illustrates an example.

10.0some text content ↦−→ StAX processing aa[10.0]ab[some text content]bbba

Figure 5.1: Symbolic StAX event stream

5.2 Models for Datatypes and Text Content

Text contents in XML are strings over the Unicode alphabet U and located between two tags or in an attribute value. In documents conforming to DOCTYPE declarations or XSD specifications, an element either contains nested elements or text content. Mixed- content XML allows interleaved text contents, e.g., for XML as a markup language like XHTML [434] or document-style web services [8]. An example for mixed content is given in Figure 3.4 on page 54. Text content can be restricted by datatypes in XSD and Relax NG, and XML language representations that capture datatypes are therefore needed. For this work, only the lexical spaces of XSD datatypes [463] are considered in a generalized notation of lexical datatypes:

Definition 10 (Lexical datatypes). Let T be a set of datatypes. A regular language over the Unicode alphabet describes a lexical space in XSD, and function φ : T REG(U) → assigns lexical spaces to datatypes.

Lexical datatypes allow to define datatyped event streams, where a datatype replaces a Unicode string as the value of a characters event.

Definition 11 (Datatyped event stream). A datatyped event stream is a sequence of startElement, endElement, and characters events. Every event e carries a value lab(e), i.e., qualified element names for startElement and endElement events and datatypes τ T for characters events. A document event stream w corresponds to a datatyped ∈ event stream w′ if w and w′ have congruent event types, the qualified element names in respective startElement and endElement events are equal, and a text in a characters event in w is in the lexical space of the congruent characters event in w′.

5.2.1 Datatyped EDTD The first step toward mixed-content support is to introduce datatypes in the schema language EDTD (Definition2 on page 56).

Definition 12 (Datatyped EDTD (dEDTD)). A tuple D = (Σ,M,T,φ,d,m0, µ) is a dEDTD, where (Σ,M,d,m0, µ) is an EDTD and extended by a set of datatypes T, and function φ assigns lexical spaces over Unicode. Production rules d : M REG(M T) → ⊎ are regular languages over both types and datatypes represented as regular expressions.

88 5.2. MODELS FOR DATATYPES AND TEXT CONTENT

Mixed content is not allowed to intervene with typing of elements. For example, DTD disallows any order of elements in a content model when mixed content is activated, and XSD content models cannot restrict datatypes of interleaved strings in mixed content. To reflect the behavior of datatype noninterference, production rules in dEDTDs are syntactically restricted: Definition 13 (Mixed content restrictions). Right-hand sides of production rules in a dEDTD have following restrictions with respect to datatypes: • Datatype choice. An allowed datatype term E in a production rule is either a single datatype τ or a choice term (τ + τ + + τ ). 1 2 ··· j • Datatype sequence. Production rules must obey d(m) (M L(E)M) L(E)?, ⊆ ∪ ∗ where L(E) is the regular set of all datatype choices. This definition supports mixed-content XML, where text is allowed between astart- and an end-tag, an end- and a start-tag, two start-tags, and two end-tags. The datatype sequence rule prevents sequences of datatype terms because they can never appear in the event stream representation, i.e., character data in a document always leads to a single characters event in an event stream, and two characters events are always interleaved by one or more element events. A source of nondeterminism in later steps is eliminated by these restrictions. The accepted language with respect to datatyped event streams is defined according to the specialized languages by Kumar et al. [223]. Every type m characterizes its typed language Ltype(m) (M T M) as the smallest set d ⊆ ⊎ ⊎ ∗

Ltype(m) m v m v R (w) and w d(m) with d ⊇ { · · | ∈ d ∈ } Rd(ε) = ε and { } a Rd(w) if a T, Rd(aw) = { }· ∈ Ltype(a) R (w) if a M. d · d ∈ Rd is a recursive helper function to resolve nested languages from production rules. Every type m accepts datatyped event streams L (m) = µ(w) w Ltype(m) . Mapping d { | ∈ d } µ is extended over strings (M T M) , where µ(τ) = τ if τ T and µ(m) = µ(m) if ⊎ ⊎ ∗ ∈ m M. A dEDTD D therefore accepts datatyped event streams L(D) = L (m ). ∈ d 0 Example 4. The exemplary dEDTD D = (Σ,M,T,φ,d,m , µ) is defined by Σ = dealer, 0 { newcars, usedcars, ad, model, year , M = dealer, newcars, usedcars, ad , ad , } { new used model, year , T and φ according to XSD datatypes, m = dealer, mapping µ assigns } 0 equivalent element names to types, and production rules d are dealer newcars oldcars, ↦→ · newcars ad∗ , ↦→ new usedcars ad∗ , ↦→ used ad model, new ↦→ ad model year, used ↦→ · model string, ↦→ year gYear + gYearMonth. ↦→ Both types adnew and adold map to the same element ad. In XSD jargon, the types model and year are simple types, and the others are complex types.

89 CHAPTER 5. GRAMMATICAL INFERENCE OF XML

5.2.2 Datatyped XVPA The dXVPA notation is an extension of the XVPA definition from Kumar et al.[223] (Definition6 on page 70) by introducing module-internal transitions for datatypes.

Definition 14 (Datatyped XVPA (dXVPA)). A dXVPA A over (Σ,M, µ,T,φ) is a tuple A = ( Qm,em,Xm,δm m M,m0,F) such that A is an XVPA that satisfies the single-exit { } ∈ call int ret property. Module transitions are δm = δm δm δm , where call and return transitions ⊎ ⊎ int capture values of startElement and endElement events, and internal transitions δm τ ⊆ q q q ,q Q and τ T capture datatypes of characters events. A dXVPA { m −→ m′ | m m′ ∈ m ∈ } needs mixed-content restrictions on datatype transitions, like dEDTDs, to guarantee that datatype transitions do not affect typing of elements:

• Datatype choice. A datatype choice at state qm leads to a single next state, i.e., if τ τ q q δ int and q ′ q δ int then q = q . m −→ m′ ∈ m m −→ m′′ ∈ m m′ m′′ • Datatype sequence. A return transition can only point to a state that is not a n/pm τ successor of a datatype choice, i.e., if q.q q δ ret then q .q q / δ int. ∃ −−−→ m ∈ n ∀ ′ ′ −→ m ∈ m Both restrictions have the same effect as in dEDTDs, and they guarantee that after a characters event, a module is either exited or another module is called. A dXVPA is a modular VPA, and the only difference is that internal transitions are utilized for datatypes. The semantics and accepted language L(A) = L(m0) are characterized by the corresponding VPA over (Σ T Σ). ⊎ ⊎ Theorem 1. Every dEDTD has a language-equivalent dXVPA, and every dXVPA has a language-equivalent dEDTD.

Proof. This constructive proof is a straightforward extension of the proof from Kumar et al. [223] by introducing internal transitions. First, the dEDTD-to-dXVPA translation is shown. Let D = (Σ,M,T,φ,d,m0, µ) be a dEDTD. For every type m, the regular language of d(m) is represented by a state-minimal DFA = (Q ,M T,e ,δ ,X ), m m ⊎ m m m where Q is a set of states, M T is the alphabet, e Q is the distinguished start state, m ⊎ m ∈ m δ : Q (M T) Q is a transition function, and X Q are the final states. The m m × ⊎ → m m ⊆ m DFAs are without dead states, i.e., for every state there is a string over M T that leads to ⊎ a final state, and δm is considered partial. Then A = ( Qm′ ,em′ ,δm′ ,Xm′ m M,m0,F) is the { } ∈ corresponding dXVPA over (Σ,M, µ,T,φ) such that for every module m M: ∈

• Qm′ = Qm, em′ = em, Xm′ = Xm • δ = δ call δ call δ call m′ m ⊎ m ⊎ m µ(n)/pm – δ call = p e n M and δ (p ,n) = q m { m −−−−−→ n | ∈ m m m} µ(m)/pn – δ ret = q q n M and δ (p ,m) = q and q X m { −−−−−→ n | ∈ n n n ∈ m} τ – δ int = q p τ T and δ (q ,τ) = p m { m −→ m | ∈ m m m} n Intuitively, a DFA transition p q creates a call transition from state p in module m −→ m m m to the entry of n and saves the state on the stack, and for all exit states in module ret n, return transitions to qm in module m are added. Since δn is deterministic, δm is

90 5.2. MODELS FOR DATATYPES AND TEXT CONTENT deterministic for all modules m,n M. The construction of δ ret guarantees that for all ∈ m final states q X the return behavior is the same, and X satisfies the single-exit property. ∈ m m′ A is therefore an dXVPA. Let w (M T M)∗ be a typed event stream. By induction ∈ ⊎ ⊎ µ(w) on length of w it can be shown that m w m Ltype(m) (e , ) (q, ) and · · ∈ d ⇐⇒ m ⊥ −−−→A ⊥ q X . Since F = X , the accepted language is L(A) = L(m ) = L(D). ∈ m m0 0 For the opposite direction, A = ( Qm,em,δm,Xm m M,m0,F) is a given dXVPA over { } ∈ (Σ,M, µ,T,φ). For every module m, a production rule d(m) = L(Dm) in dEDTD D = (Σ,M,T,φ,d,m , µ) is then defined in terms of DFA D = (Q ,M T,e ,δ ,X ), where 0 m m′ ⊎ m′ m′ m′ M T is the alphabet, Q = Q ,e = e ,X = X , and δ is specified from return and ⊎ m′ m m′ m m′ m m′ internal transitions, i.e.,

µ(n)/pm ret if q q δ then δ ′ (p ,n) = q , −−−−−→ m ∈ m m m m τ int if p q δ then δ ′ (p ,τ) = q . m −→ m ∈ m m m m The single-exit property ensures that δm′ is well-defined for every type m. To show equivalence of accepted languages, let w (M T M) be a typed event stream. By ∈ ⊎ ⊎ ∗ induction on the length of w it can be shown that dEDTD D accepts exactly the datatyped type event streams that lead to an exit state in dXVPA A, i.e., m w m Ld (m) µ(w) · · ∈ ⇐⇒ (e , ) (q, ) and q X . The dEDTD accepts L(D) = L (m ) = L(A). m ⊥ −−−→A ⊥ ∈ m d 0 Example 5. Consider the dEDTD defined in Example4. Figure 5.2 illustrates the dXVPA constructed according to Theorem1. The states q0 and q f have been added to highlight the corresponding VPA semantics. Note that module model is called by modules adnew and adused, and based on the top of the stack, a run returns correctly to the caller.

q0 q f

dealer/q0 dealer/q0

e q x newcars/edealer dealer usedcars/qdealer

usedcars/q newcars e dealer e usedcars newcars/edealer

ad/eusedcars ad/enewcars ad/enewcars ad/eusedcars

e adnew x e q adused x

model model/e /eadused model/eadnew adnew year/qadused year/qadused

string model/eadused gYear,gYearMonth e x e x model year

Figure 5.2: A dXVPA accepting the same documents as the dEDTD in Example4

The 1PPT property (Subsection 3.2.3 on page 55) characterizes the schemas, where the type for every element is determined when the open-tag is met, and it is therefore a necessary condition of schemas for efficient stream validation [254]. Following Kumar et al. [223], a dEDTD has the 1PPT property if the language-equivalent dXVPA is deterministic.

91 CHAPTER 5. GRAMMATICAL INFERENCE OF XML

5.2.3 Character-Data XVPA for Stream Validation A dXVPA has limitations in efficient stream validation of document event streams because it is designed for datatyped event streams. If a dXVPA is in state pm in module m and a characters event e encountered, the automaton can only proceed to state qm if there exists τ an internal transition p q δ int such that lab(e) φ(τ) for the observed characters m −→ m ∈ m ∈ event in the document event stream. In the worst case, character data lab(e) needs to be buffered, and MEMBERSHIP needs to be checked for every possible datatype, i.e., O( lab(e) T ). | | · | | For efficient stream validation, this procedure can be improved by unifying the ac- cepted lexical spaces in a datatype choice between two states as a single predicate ψ Ψ ∈ over character data in characters events in document event streams. Definition 15 (Character-data XVPA (cXVPA)). A cXVPA A over (Σ,M, µ,Ψ) is a tuple A = ( Qm,em,Xm,δm m M,m0,F) and refines the dXVPA definition by utilizing internal { } ∈ transitions δ int : Q Ψ Q over predicates to capture lexical spaces. Furthermore, m m × → m there can only be a single outgoing internal transition per state in a cXVPA, i.e., if ψ ψ j p i q and p q then q = q and ψ = ψ . m −→ m m −→ m′ m m′ i j The semantics of a cXVPA and its accepted language L(A) are characterized by the corresponding VPA over (Σ Ψ Σ), where Ψ represents the predicates over Unicode ⊎ ⊎ strings in characters events, and a run on a event stream proceeds along an internal ψ transition p q δ int if ψ(lab(e)) holds in state p . m −→ m ∈ m m Intuitively, a datatype choice at a particular state in a dXVPA is replaced by a single predicate in a cXVPA. If the predicate holds for observed character data when validating a document event stream then there has been at least one matching datatype in the original dXVPA choice, and validation can proceed. Theorem 2. Every dXVPA can be translated into a cXVPA that accepts the corresponding document event streams.

Proof. This is shown by construction. Suppose A = ( Qm,em,δm,Xm m M,m0,F) is { } ∈ a given dXVPA over (Σ,M, µ,T,φ). Then A′ = ( Qm,em,δm′ ,Xm m M,m0,F) over { } ∈ (Σ,M, µ,Ψ) is a cXVPA, where internal transitions on datatypes are replaced by predi- cates on characters events. States, modules, calls and returns remain unchanged, and helper functions are nec- τ essary for constructing internal predicates. Let P = (p ,q ) τ.p q δ int m { m m | ∃ m −→ m ∈ m } identify state pairs with one or more datatype transitions in between, and function τ ch(p ,q ) = τ p q δ int returns the datatype choice between a state pair. m m { | m −→ m ∈ m } Transitions δ = δ call δ ret δ ′int and the set of predicates Ψ are constructed from m′ m ⊎ m ⊎ m ⎛ ⎞ ⋃ ψp ,q (e) = MEMBERSHIP ⎝lab(e) φ(τ)⎠, m m ∈ τ ch(p ,q ) ∈ m m ψp ,q δ ′int = p m m q (p ,q ) P , m { m −−−−→ m | m m ∈ m} ψ Ψ = ψ m.p q δ ′int . { | ∃ m −→ m ∈ m }

Predicate ψpm,qm represents a decision problem (language acceptance) for the allowed lexical spaces between states pm and qm. Lexical spaces of lexical datatypes are defined

92 5.3. A LEXICAL DATATYPE SYSTEM FOR DATATYPE INFERENCE

by regular languages, where the union operation is closed. Let L(ψpm,qm ) be the accepted union of datatype lexical spaces represented by a DFA over Unicode. Then, worst-case time complexity of stream validating a characters event becomes O( lab(e) ). | | The datatype sequence rule guarantees that a deterministic dXVPA also has a deter- ministic cXVPA because calls and returns remain unchanged, and successor states after a datatype choice stay the same, i.e., if (p ,q ) P and (p ,q ) P then q = q . m m ∈ m m m′ ∈ m m m′ Language equivalence with respect to document event streams between dXVPA A and its corresponding cXVPA A′ follows from congruent states, congruent final states, τ and congruent call and return transitions. And for every internal transition p q in A, ψ −→ there is an internal transition p q in A such that φ(τ) L(ψ) holds. −→ ′ ⊆ For stream validation, the predicate approach in cXVPAs has significant advantages because all necessary computations with respect to datatypes at a certain location in a document are bundled into a single predicate check. In particular, representing L(ψ) as a DFA for the MEMBERSHIP decision in predicate ψ enables the use of high-performance automata libraries in an implementation. Example 6. Consider the dXVPA defined in Figure 5.2. The corresponding cXVPA has congruent states and call and return transitions; however, internal transitions in modules string gYear, gYearMonth model and year differ. Instead of emodel xmodel and eyear xyear, the −−−→ ψ −−−−−−−−−−−→ψ corresponding cXVPA has transitions e 1 x and e 2 x respectively, model −→ model year −→ year where predicate ψ (e) = lab(e) φ(string) and predicate ψ (e) = lab(e) φ(gYear) 1 ∈ 2 ∈ ∪ φ(gYearMonth). Corollary. Every dEDTD can be translated into a cXVPA, and the cXVPA is deterministic if the originating schema has the 1PPT property. A deterministic cXVPA validates a document event stream in a single pass and in linear time, and space is bounded by the nesting depth of the document.

5.3 A Lexical Datatype System for Datatype Inference

The previous section shows that a cXVPA for linear-time stream validation can be constructed from a given dEDTD or dXVPA, where datatypes and their respective lexical spaces are given beforehand. A learner observes strings over Unicode in characters events without knowing the language class these strings were generated from. The language class however greatly affects learnability of a language, and as a first generalization step, a lexical datatype system and algorithms for inferring finite sets of datatypes from the infinite domain of Unicode strings is proposed. The lexical datatype system infersa set of satisfying datatypes for a string over Unicode. Two steps are involved: lexical subsumption and a preference heuristic based on datatype semantics. Definition 16 (Lexical datatype system). A tuple dts = (T,φ, , ) is a lexical datatype ∼s ≤s system, where T is a set of lexical datatypes, and φ : T REG(U) assigns lexical spaces → to datatypes. Datatypes must be lexically distinct, i.e., φ(τ) = φ(τ ) = τ = τ , and ′ ⇒ ′ φ imposes a partial ordering , i.e., τ τ φ(τ) φ(τ ). T always contains ≤lex ≤lex ′ ⇐⇒ ⊆ ′ a unique top datatype that accepts any Unicode string, i.e., φ( ) = U . Equivalence ⊤ ⊤ ∗ relation : T K partitions T into kinds K with respect to datatype semantics, and ∼s → ≤s is an ordering on kinds. Moreover, the kinds impose a semantic ordering on datatypes, ≤s′ i.e., τ s′ τ′ [τ] s s [τ′] s . ≤ ⇐⇒ ∼ ≤ ∼

93 CHAPTER 5. GRAMMATICAL INFERENCE OF XML

5.3.1 Lexical Subsumption Figure 5.3 shows the lexical datatypes T based on the primitive and build-in XSD datatypes [463]. Unicode regular expressions in the XSD standard characterize φ, and is computed by a the software prototype. Some datatypes are lexically indistinguish- ≤lex able and are therefore not included, i.e.,

double =lex float, NCName =lex ENTITY =lex ID =lex IDREF, NMTOKENS =lex ENTITIES =lex IDREFS.

For a location with text content in a document, a learner should infer the “least lexical space” that covers all the observed strings in training examples at the particular location. This “least lexical space” is approximated by a set of required datatypes (i.e., a datatype choice), and these datatypes have to be minimal with respect to . ≤lex Semantic XML attacks affecting text content, e.g., by placing forbidden characters in CDATA sections, eventually use characters or substrings that are not in the inferred lexical space, approximated by a set of datatypes, and the automaton would reject.

string base64Binary

normalizedString token dateTime dateTimeStamp gMonth QName gYear gDay ENTITIES NCName gMonthDay gYearMonth date time anyURI double decimal

NMTOKENS nonPositiveInteger integer

nonNegativeInteger positiveInteger NMTOKEN duration unsignedLong long

unsignedInt int negativeInteger Name unsignedShort short yearMonthDuration boolean dayTimeDuration hexBinary language unsignedByte byte

Figure 5.3: Ordering on lexically distinct XSD datatypes ≤lex

Definition 17 (Minimally required datatypes). A set of minimally required datatypes for a string w over Unicode is the nonempty antichain R T of minimal datatypes with ⊆ respect to such that τ R = w φ(τ), and τ < τ for τ R = w / φ(τ ). ≤lex ∈ ⇒ ∈ ′ lex ∈ ⇒ ∈ ′ 94 5.3. A LEXICAL DATATYPE SYSTEM FOR DATATYPE INFERENCE

Algorithm 1: minLex computes minimally required datatypes Input :lexical datatype system (T,φ, , ) ∼s ≤s Unicode string w Output: minimally required datatypes R T ⊆ 1 R := {} 2 candidates := T 3 for τ in topologicalSortOrder(T, ) do ≤lex 4 if candidates = /0 then done 5 else if τ candidates and w φ(τ) then ∈ ∈ 6 R := R τ ∪ { } 7 candidates := candidates x T τ x \{ ∈ | ≤lex }

The minimally required datatypes for a string are computed by Algorithm1( minLex). The algorithm terminates after T steps in the worst case. Containment of a string in | | respective lexical spaces is checked in topological sort order on T. To minimize the number of checks, a candidates set is maintained. If the string is in the lexical space of some datatype, all greater datatypes are removed from the candidates because the transitivity of guarantees a positive containment in all of them. Furthermore, the ≤lex topological sort order guarantees that the matched datatypes are minimal with respect to . Together with the removal of candidates, matched datatypes are guaranteed to be ≤lex incomparable, i.e., an antichain.

Lemma 1. Algorithm1( minLex) always returns a nonempty set for the proposed lexical datatype system (Figure 5.3). Proof. The datatype has lexical space U and matches for any string, even special ⊤ ∗ characters embedded in a CDATA section, and Algorithm1 always returns if none {⊤} of the other datatypes match. This property of the algorithm is important because for learning, the minimally required datatypes at a certain location are modeled as a datatype choice in a dXVPA, and an empty set would introduce unreachable states.

5.3.2 Linear-Time Lexical Subsumption Computation Algorithm1 has drawbacks in stream processing: in the worst case, the text content lab(e) of a characters event e needs to buffered, and MEMBERSHIP needs to be decided T times. When precomputed DFAs represent the lexical spaces, the worst-case time | | complexity becomes O( T lab(e) ). One strategy to eliminate the need for buffering is | | · | | to simulate T DFAs in parallel on each symbol in lab(e), however, the time complexity | | remains unchanged in a single-core machine. An intuitive approach toward O( lab(e) ) time complexity for fast inference in a | | ⋃ streaming setting is precomputing a single DFA A such that L(A) = τ T φ(τ).A refinement of DFAs is however necessary to keep track of datatypes∈ in final states.

tf Definition 18 (DFA with typed final states (DFA )). Tuple A = (Q,Σ,δ,q0, fin) is a tf DFA , where Q, Σ, δ, and q0 have the usual DFA semantics. The alphabet is Unicode in the setting of a lexical datatype system. Function fin : Q P(T) defines acceptance → 95 CHAPTER 5. GRAMMATICAL INFERENCE OF XML by assigning sets of datatypes to final states. Function δ : Q Σ Q extends the ∗ × ∗ → transition function to strings, and the accepted language is L(A) = w fin(δ (q ,w)) = { | ∗ 0 ̸ /0 . The set of required datatypes for some string w is then lex(w) = fin(δ (q ,w)), and } ∗ o MEMBERSHIP is decided in O( w ). | | The union operation A A for two DFAtfs having the same alphabet is defined in ∪ ′ terms of a product automaton A′′ = (Q′′,Σ,δ ′′,q0′′, fin′′) over the Cartesian product state space Q = Q Q with a new transition relation δ ((q,q ),a) = (δ(q,a),δ (q ,a)), a ′′ × ′ ′′ ′ ′ ′ new start state q = (q ,q ), and a new final state function fin (q,q ) = fin(q) fin (q). 0′′ 0 0′ ′′ ′ ∪ ′ The accepted language of this product automaton is L(A ) = L(A) L(A ). ′′ ∪ ′ The DFAtf approach to subsumption-based datatype inference is precomputing an automaton as explained in Algorithm2( precompute). The algorithm subsequently extends a product automaton A from DFAtfs representing lexical spaces of datatypes. As returned sets of datatypes from lex(w) are not necessarily antichains, after constructing the product automaton, returned sets by function fin are restricted to minimal datatypes. Finding the minimally required datatypes for string w then becomes minLextf(w) = lex(w) which is computed in O( w ). | |

Algorithm 2: precompute a DFAtf for linear-time lexical subsumption Input: lexical datatype system (T,φ, , ) ∼s ≤s Output: a DFAtf A

1 A := ( q ,U, ,q , q /0 ) // initially, L(A) = /0 { 0} {} 0 { 0 ↦→ } 2 for τ in T do tf 3 A′ := a complete and state-minimal DFA that accepts φ(τ) 4 A := A A // product automaton ∪ ′ 5 Let Q and fin refer to the states and final state function in A 6 for q in Q do 7 fin(q) := min fin(q) // minimal antichain at q ≤lex

Lemma 2. Based on a precomputed DFAtf (Algorithm2), minLextf(w) always returns a nonempty set of datatypes for the proposed lexical datatype system (Figure 5.3).

Proof. The top datatype accepts any string, and the corresponding DFAtf has a single ⊤ state which is both start state and final. By definition, the datatype is contained in any ⊤ lexical datatype system. Constructing the union with such an automaton turns every state in the product automaton into a final state such that fin(q) for all states q. Updating ⊤ ∈ function fin only removes if there is a smaller datatype available, and fin(q) never ⊤ become empty for all states q.

tf It should be noted that the precomputation of DFAT is time and space intense. A lexical space in XSD is represented by a regular expression, and the translation process typically involves Thompson’s construction into nondeterministic finite automata (NFAs) and subset construction for determinization. Subset construction leads to exponentially tf many states in a DFA in the worst case [185]. Furthermore, the product automaton DFAT T tf has size O(m| |), where m is the size of the largest DFAτi involved. State minimization in between the union operations could mitigate the exponential blow-up but raises time complexity.

96 5.3. A LEXICAL DATATYPE SYSTEM FOR DATATYPE INFERENCE

The state explosion problem, when matching multiple regular expressions simultane- ously, is a well-discussed problem in pattern matching and intrusion detection communi- ties. For a survey on the trade-off between fast precomputation and fast simulation, the reader is referred to Yu et al. [495]. Abdelhamid [3] presents an approach to circumvent the state explosion by exploiting parallelism and representing NFAs as reduced ordered boolean decision diagrams which could be utilized for DFAtf implementation.

5.3.3 Preference Heuristic The lexical subsumption order already suggests that lexical spaces of XSD datatypes are often incomparable and ambiguous. In practice, this leads to ambiguous minimally required datatypes, e.g., minLex(false)= language, boolean, NCName . While the { } datatype choice is correct from a lexical standpoint, some datatypes are semantically more specific and therefore preferred over others. A second step in datatype inference is therefore to remove the least specific datatypes from minimally required datatypes. The proposed heuristic captures type derivations in the XSD standard and datatype semantics in an ordering for kinds of datatypes. Figure 5.4 illustrates the ordering, ≤s and kinds are defined as

stringLike = string, normalizedString, token, ENTITY, ID, IDREF, { NMTOKEN , } listLike = ENTITIES, IDREFS, NMTOKENS , { } structureLike = anyURI, NOTATION, QName, language, Name, NCName , { } encodingLike = base64Binary, hexBinary , { } temporalLike = date, gDay, gMonth, gMonthDay, gYear, gYearMonth, { duration, dayTimeDuration, yearMonthDuration, dateTime, dateTimeStamp, time , } numericLike = decimal, integer, nonNegativeInteger, positiveInteger, { nonPositiveInteger, negativeInteger , } atomicNumericLike = float, double, long, int, short, byte , { } atomicUnsignedLike = unsignedLong, unsignedInt, unsignedShort, unsignedByte , { } booleanLike = boolean . { } Note that there is also a distinguished kind for the datatype for completeness. ⊤ ⊤ Algorithm3( pref) explains the heuristic. Function pref compares pairs of minimally required datatypes, and if two datatypes are comparable with respect to , the less ≤s specific datatype is removed from the original set. The resulting set R′ is still an antichain of datatypes with respect to , and pref clearly terminates because R is finite. ≤lex Algorithm 3: pref selects only specific datatypes based on ≤s Input :lexical datatype system (T,φ, , ) ∼s ≤s set of datatypes R T ⊆ Output: set of datatypes R T ′ ⊆ 1 R′ := R 2 for τ,τ in R and τ = τ do ′ ̸ ′ 3 if [τ]

97 CHAPTER 5. GRAMMATICAL INFERENCE OF XML

stringLike structureLike

temporalLike encodingLike numericLike

atomicNumericLike

atomicUnsignedLike

booleanLike listLike

Figure 5.4: Ordering on kinds of lexical datatypes ≤s

5.3.4 Datatyped Event Stream for Learning Suppose a document is an event stream over startElement, endElement, and characters events, where events are labeled by qualified names and Unicode text respectively. For learning, every text is mapped to its set of minimally required datatypes. In particular, the following functions are utilized:

minReq(w) = pref(minLex(w)) for some string w, and (5.1) { minReq(lab(e)) if e is a characters event, datatyped(e) = (5.2) e else.

An example is illustrated in Figure 5.5. It should be noted that the proposed lexical datatype system is not strictly bounded to XSD datatypes and extensible. Arbitrary lexical datatypes can be added as long as they have unique names and distinct lexical spaces. Only a datatype with respect to is required by definition to ensure a nonempty set ⊤ ≤lex of minimally required datatypes.

aa[10.0]ab[some text content]bbba ↦−→ datatyped aa decimal ab token bbba { } { } Figure 5.5: Symbolic datatyped event stream

5.3.5 Combining Datatypes and Incremental Update So far, only the transformation of a single string to its minimally required datatypes has been covered. Learning also needs a way of aggregating datatypes from different text contents from similar locations in different documents. Let v and w be to strings over Unicode. The minimally required datatypes that accept both strings are then

98 5.4. LEARNING FROM POSITIVE EXAMPLES

minReq(v,w) = max minReq(v) minReq(w). (5.3) ≤lex ∪ The minimally required datatypes of two strings v,w are the union of their individual datatypes, and Equation 5.3 can be easily extended to a sets of strings. The resulting set is necessarily not an antichain, and the max lex operation therefore returns the nonempty antichain of maximal datatypes with respect≤ to that cover both strings. It should be ≤lex noted that the pref heuristic is string centric, i.e., the heuristic must not be applied on unions of sets of datatypes because evidence which has been gained from a particular string could get lost.

Example 7. Consider the strings S = 1, 0, true, 33 . The intuitive datatype choice { } (boolean+integer) explains S, and the proposed algorithms should find a suitable datatype choice in terms of minimally required datatypes. To illustrate lexical ambiguity, min- Lex(1) = boolean, byte, positiveInteger, unsignedByte is a good example. While { } these datatypes are lexically correct, there is semantic overlapping, and the pref heuristic reduces ambiguity by keeping the most specific datatypes, i.e.:

minReq(1) = boolean , { } minReq(0) = boolean , { } minReq(true) = boolean , { } minReq(33) = unsignedByte . { } Computing the antichain of maximal datatypes for covering the union of all strings is then minReq(S) = boolean,unsignedByte . Compared to the intuitive choice, the { } inferred choice is consistent, and the numeric datatype is more constrained.

Incremental Updates The proposed learning setting in the next section is incremental. Let R be the minimally required datatypes learned so far, and a new string w is encountered. The updated minimally required datatypes are computed as follows:

minReqInc(R,w) = max R minReq(w). (5.4) ≤lex ∪ Initially, R = /0, and with every w the lexical coverage either remains unchanged or increases. Note that the preference heuristic is applied individually on every w, new datatypes are added, and a maximal antichain returned. This operation is required for incrementally updating an intermediate VPA representation, where minimally required datatypes are represented as internal transitions between to states.

5.4 Learning from Positive Examples

The assumed learning setting is Gold’s learning in the limit from positive examples [155] (Definition8 on page 82), where a learner receives examples (i.e., document event streams) and generates respective hypotheses (i.e., dXVPAs). For incremental learning, the learner maintains an intermediate VPA representation to capture what has been learned so far, a newly encountered example updates the intermediate representation. A resulting dXVPA is derived from the intermediate VPA.

99 CHAPTER 5. GRAMMATICAL INFERENCE OF XML

Theorem 3 (Gold [155]). The language class of unrestricted regular expressions is not learnable in the limit from positive examples only.

This result immediately prohibits grammatical inference from example XML docu- ments in the unrestricted case if the originating schema has an iterating term in some production rule. A learning algorithm in the proposed setting therefore needs to assume a learnable language class that covers a large portion of practical schemas. Such an algo- rithm automatically becomes a heuristic that requires experimental evaluation. Indeed, practical studies have been undertaken to understand the complexity of practical schemas, and two aspects are relevant for the proposed learning approach:

• Simplicity of regular expressions in production rules. With respect to the complex- ity of regular expressions, Bex et al. [50] have examined 202 DTDs and XSDs and conclude that the majority of regular expressions in production rules are simple, and alphabet symbols (i.e., elements and types) occur only a small number of times in each rule.

• Locality of typing contexts. Martens et al. [254] have studied 819 DTDs and XSDs from the web and XML standards, and it turns out that typing elements in 98% of the schemas is local, i.e., the type of an element only depends on its parent.

The proposed learning approach exploits the locality in simple regular expressions for production rules and typing in local contexts, and a state merging approach, motivated by the k-testable regular languages, is presented.

5.4.1 The k-Testable Regular Languages for Element Locality With respect to simple regular expressions in production rules, Bex et al. [48, 49] charac- terize SOREs as regular expressions, where an alphabet symbol occurs exactly once, and i-OREs are a generalization of SOREs, where an alphabet symbol occurs at most i times. SOREs are therefore 1-OREs, and as an example, ab∗c is a SORE, and ab∗a is a 2-ORE. The authors claim that the vast majority of regular expressions in schemas are SOREs, and every SORE is a 2-testable regular language which can be efficiently learned [49].

Definition 19 (k-testable regular language [144, 178]). A regular language is k-testable in a strict sense for k > 0 if it can be fully characterized by four sets (I,F,T,C) of strings over alphabet Σ, where I,F Σk 1 are initial prefixes and final suffixes, C Σ

100 5.4. LEARNING FROM POSITIVE EXAMPLES states that share the same (k 1)-length name suffix. Note that a PTA is deterministic by − construction, and state merging toward a k-testable DFA preserves the determinism.

Example 8. The set of examples ac, abbcd, abc, acddd, abbbcd is characteristic for { } the SORE ab∗cd∗, and a corresponding 2-testable DFA can be correctly inferred. The automata are illustrated in Figure 5.6.

ε ε a a c ac d acd d acdd d acddd a ab abc c c a c b c abb c abbc d abbcd b d b b d abbb c abbbc d abbbcd b b d (a) PTA with named states (b) 2-testable DFA

Figure 5.6: PTA and corresponding 2-testable DFA

The k-testable languages, including all finite languages, can be efficiently learned in the sketched state merging approach if k is sufficiently large. However, expressiveness is restricted, and there are regular languages that are not k-testable, e.g., the regular language aΣ∗a + bΣ∗b is not k-testable [178, p. 219], and there are also i-OREs with i > 1, like a∗ba∗, that are not k-testable for any k [49]. State merging is an established method for generalization in grammatical inference of regular languages, e.g., k-reversible learning [15, 178], RPNI (Regular Positive and Negative Inference) [315], and EDSM (evidence driven state merging) [230]. The ref- erenced methods also start by constructing a PTA, and for generalizing the language, states in the PTA are merged. While k-reversible is a determinism property for guiding the generalization, RPNI and EDSM assume a different learning setting, where nega- tive examples constrain generalization. Learning k-testable regular languages has its shortcomings with respect to expressiveness; however, k 2 covers a vast majority of ≥ expressiveness encountered in schema regular expressions according to Bex et al. [48]. A learner for k-testable regular languages has useful properties, and state merging in a similar manner is therefore proposed.

5.4.2 Schema Typing Mechanisms for Type Contexts Studies on the expressiveness of schema languages by Murata et al. [280] and Martens et al. [254] have investigated the necessary conditions for deterministic typing and discuss how syntactic constraints, e.g., in DTD and XSD, affect deterministic typing. These typing mechanisms are exploited in the proposed learner for naming states from prefixes of event streams, so the generalization by state merging respects typing contexts. Intuitively, a schema has the 1PPT property for typing the event stream of a document when the type of every startElement event is determined from the already observed prefix of the event stream. Deterministic regular expressions in DTD and XSD’s EDC and UPA restrictions guarantee this 1PPT property, and in the proposed learning approach, typing information is embedded in named states.

101 CHAPTER 5. GRAMMATICAL INFERENCE OF XML

Figure 5.7 illustrates typing mechanisms in schema languages. An XML document is represented as tree t, a node is denoted as v t, and the qualified element name of node ∈ v is lab(v). The 1PPT property basically states that the subtree of nodes traversed up to node v in document order (preceding(v) in Figure 5.7c) determines the type of node v. The preceding corresponds exactly to the prefix of a startElement event in the event stream representation. The expressiveness of unrestricted EDTDs even allows that node types are determined by its children, denoted as preceding-subtree(v) in Figure 5.7d, which corresponds to a delayed typing decision until a respective endElement event is encountered. Note that preceding-subtree-based types can imply nondeterminism. In DTDs, elements cannot have context-dependent meaning because types equal elements, and the type of node v is fully determined by the element name lab(v). This coincides with typing a startElement event in an event stream by its element name only.

t t t t

v v v v

(a) anc-str(v) (b) anc-lsib-str(v) (c) preceding(v) (d) preceding-subtree(v)

Figure 5.7: Typing of an element

Ancestor-Based Typing The expressiveness of XSD is captured by the single-type EDTDs, where the typing mechanism is ancestor-based [280, 254]. In other words, the syntactic EDC and UPA restrictions guarantee that the type of node v is fully determined by its ancestor string anc-str(v) = lab(i ) lab(i ) lab(i ) such that i is the root node, i = v, and i is a 1 · 2 ··· n 1 n j+1 child of i j. The ancestor string is shown in Figure 5.7a.

Definition 20 (Ancestor string for event stream prefixes). For a document event stream prefix w, the ancestor string anc-str(w) = as(w,ε) corresponds exactly to the string of unmatched startElement events in a prefix and is recursively defined as

as(ε,u) = u if the event stream is empty, (5.5) as(ew,u) = as(w,u lab(e)) if e is a startElement event, (5.6) · as(ew,ua) = as(w,u) if e is an endElement event and lab(e) = a (matching return), (5.7) as(ew,u) = as(w,u) else. (5.8)

Function as clearly terminates for finite-length event stream prefixes because Equa- tion 5.6–5.8 guarantee progress for a nonempty stream and Equation 5.5 terminates when the stream has been consumed. The second argument of the function acts as an accumula- tor. In case of a startElement event in Equation 5.6, the element name is concatenated to the accumulator and again removed by a matching endElement event in Equation 5.7. The characters events are skipped by leaving the accumulator unchanged in Equation 5.8.

102 5.4. LEARNING FROM POSITIVE EXAMPLES

Ancestor-Sibling-Based Typing A learner is not aware of the expressiveness beforehand, and ad hoc XML or documents generated from a Relax NG grammar could exceed the expressiveness of ancestor-based typing. In a streaming setting, the preceding represents the largest available context of an element for a deterministic typing decision. A surprising result by Martens et al. [254] proves that typing based on ancestors and their siblings, i.e., the ancestor-sibling string anc-lsib-str(v), is sufficient for the 1PPT property, or in other words, an EDTDwith ancestor-sibling-based types also has preceding-based types. Let u1,...,um be sibling nodes of some node v, and the left-sibling string is lsib(v) = lab(u ) lab(u ) lab(v). The ancestor-sibling string of node v is anc-lsib-str(v) = 1 ··· m · lsib(i )#lsib(i )# #lsib(i ) such that i is the root node, i = v, and i is a child of i . 1 2 ··· n 1 n j+1 j The ancestor-sibling string is illustrated in Figure 5.7b.

Definition 21 (Ancestor-sibling string for event stream prefixes). For a prefix of a docu- ment event stream w, the ancestor-sibling string is defined in terms of a recursive function anc-lsib-str(w) = als(w,ε,ε), where

als(ε,u,v) = u#v if the event stream is empty, (5.9) als(ew,u,v) = als(w,u#v lab(e),ε) if e is a startElement event, (5.10) · als(ew,u#u′a,v) = als(w,u,u′a) if e is an endElement event and lab(e) = a (matching return), (5.11) als(ew,u,v) = als(w,u,v) else. (5.12)

Function als terminates for finite-length event stream prefixes. Equation 5.10–5.12 consume a nonempty stream and Equation 5.9 returns an ancestor-sibling string when the end of the stream has been reached. The second and third argument of the function are accumulators for ancestor-sibling and left-sibling strings respectively. In case of a startElement event in Equation 5.10, the current left siblings are concatenated with the element name, the result is appended to the ancestor-sibling accumulator, and siblings are set to the empty string. An endElement event in Equation 5.11 splits the previously ap- pended result from the ancestor-sibling accumulator and restores the sibling accumulator. Equation 5.12 guarantees that characters events leave the accumulators unchanged.

5.4.3 Visibly Pushdown Prefix Acceptor A VPPA is similar to a PTA: every prefix of example event streams generates a unique state, and a VPPA accepts the examples seen so far. States in a VPPA are therefore named according to the following definition.

Definition 22 (VPA state naming). A state gets a name (u,v), where u is a typing context and v is an extended left-sibling string ,i.e., a prefix within the typing context. The special symbols # and $ represent the left-sibling separator and a placeholder for character data. A typing mechanism needs to be chosen for naming VPA states:

• Ancestor-based states. A state (u,v) (Σ (Σ $ ) ) is a pair of ancestor string ∈ ∗ × ∪{ } ∗ and left-sibling string.

• Ancestor-sibling-based states. A state (u,v) ((Σ $,# ) (Σ $ ) ) is a ∈ ∪ { } ∗ × ∪ { } ∗ pair of ancestor-sibling string and left-sibling string.

103 CHAPTER 5. GRAMMATICAL INFERENCE OF XML

For learning within the XSD language class, states need to be ancestor based, and for learning beyond XSD expressiveness but within the larger 1PPT language class, states must be ancestor-sibling based.

Incremental VPPA Updates Algorithm4( incVPA) specifies the incremental update of an intermediate VPA forthe VPPA construction based the proposed state naming schemes. The algorithm starts initially with an A = ( (ε,ε) ,(ε,ε), /0, (ε,ε) , /0) which has a single nonaccepting start { } { } state, and elements are not defined yet, i.e., Σ = /0. Furthermore, the stack alphabet is implicitly the set of states in the proposed algorithm. A start state (ε,ε) indicates that no typing contexts and no left siblings have been observed yet.

Algorithm 4: incVPA for incrementally updating an intermediate VPA Input :intermediate VPA A = (Q,q ,F,Q,δ) over Σ T Σ 0 ⊎ ⊎ lexical datatype system (T,φ, , ) ∼s ≤s functions callName, intName, and retName for state naming well-matched document event stream w Output: updated VPA A

1 s := // empty stack ⊥ 2 q := q0 // current state 3 for e in w do 4 switch eventType(e) do 5 case startElement 6 Σ := Σ lab(e) // update element names ∪ { } 7 q′ := callName(q,lab(e)) call call lab(e)/q 8 δ := δ q q ∪ { −−−−−→ ′} 9 s := s q · 10 q := q′

11 case endElement 12 let s = vp // p is top of the stack 13 q′ := retName(q, p,lab(e)) ret ret lab(e)/p 14 δ := δ q q ∪ { −−−−−→ ′} 15 s := v 16 q := q′

17 case characters 18 q′ := intName(q) τ int 19 let R = τ q q′ δ int {int| −→ τ ∈ } τ 20 δ := δ q q τ R q q τ minReqInc(R,lab(e)) \{ −→ ′ | ∈ } ∪ { −→ ′ | ∈ } 21 q := q′

22 F := F q ∪ { }

The algorithm iterates over document event stream w in a single pass and terminates for a finite-length w. New element names are added to Σ, and new states and transitions

104 5.4. LEARNING FROM POSITIVE EXAMPLES are added to the intermediate VPA. The algorithm maintains a current state q and a stack s to move along transitions. To keep the algorithm abstract, three functions that generate the next state name must be provided for a chosen naming scheme:

• callName : Q Σ Q × → • retName : Q Q Σ Q × × → • intName : Q Q → The intermediate VPA representation always has start state (ε,ε), and new states from prefixes of document event streams are generated inductively. Function intName is the same for both naming schemes and defined as

intName((u,v)) = (u,v $). (5.13) · The typing context of the new state stays the same, while placeholder $ is appended to the left siblings for denoting the successor state of a datatype choice from some named state q to intName(q). Function intName implements Equation 5.8 and 5.12 for both ancestor- and ancestor-sibling-based typing mechanisms, where characters events leave the typing context untouched. The placeholder approach is needed because datatype transitions evolve over time by incremental updates, and successor states after a datatype choice must satisfy the datatype sequence rule. A placeholder without embedded datatype information guarantees that datatypes in mixed-content XML do not intervene with typing. In lines 18–20, the algorithm gathers existing datatype transitions between current state q and next state intName(q), removes these transitions from δ int, and adds new transitions for the minimally required datatypes computed by incremental function minReqInc from existing and newly observed datatypes. The updated internal transitions between q and intName(q) are therefore a valid datatype choice.

Functions for Ancestor-Based Typing The naming scheme for ancestor-based typing of a state (u,v) requires u to be an ancestor string and v to be an extended left-sibling string within the context of u. The callName and retName functions are therefore defined as

callNameas((u,v),e) = (u lab(e),ε), (5.14) · retNameas((ua,v), p,e) = (u,π (p) lab(e)). (5.15) 2 · The function callNameas has two arguments: the current named state (u,v) and a startElement event e, and it implements Equation 5.6 for ancestor-based typing. An event’s element name is lifted into the next typing context and left siblings are set to ε because there are no siblings in the new context yet. Similarly, retNameas implements Equation 5.7 for generating typing contexts. The function has three arguments: the current named state (ua,v), the saved state p from the top of the stack, and endElement event e. The element name lab(e) = a is already part of the typing context ua in the current state, and in the generated next state, the typing context returns to u. The state’s second component is derived from left siblings in state p by appending lab(e) as a new left sibling.

105 CHAPTER 5. GRAMMATICAL INFERENCE OF XML

Example 9. An example for Algorithm4 using the functions for ancestor-based typing is illustrated in Figure 5.8a. Every prefix of the event stream has an associated named stateas shown in Figure 5.8b. Choosing ancestor-based typing can already lead to generalization which manifests as a loop in the resulting intermediate VPA in Figure 5.8b. The two b elements in context of the same a element lead to the identical state (ab,ε) according to callNameas, and the consequence is a loop.

Functions for Ancestor-Sibling-Based Typing Alternatively, for ancestor-sibling-based typing, the callName and retName functions are defined as

callNameals((u,v),e) = (u#v lab(e),ε), (5.16) · als retName ((u#u′a,v), p,e) = (u,u′a). (5.17)

Function callNameals implements the typing mechanism according to Equation 5.10 and has two arguments: the current state (u,v) and a startElement event e. The generated next state lifts the current siblings and the event’s element name into the typing context, and left siblings are set to ε. Function retNameals implements Equation 5.11 and has three arguments: the current state (u#u′a,v), where lab(e) = a is already part of the typing context, the saved state p from the top of the stack, and endElement event e. The generated next state gets the previous typing context u and has updated siblings u′a. Example 10. The ancestor-sibling-based state naming scheme for prefixes of an event stream example is illustrated in Figure 5.8c, and the resulting VPA is shown in Figure 5.8d. It should be noted that choosing ancestor-sibling-based typing extends the learnable language class beyond XSD because two identical element names having different types in the same content model can be learned but this would violate the EDC constraint.

ε,ε ε,a

aa decimal ab token bbba { } { } a a,a a a,ab ε,ε a,ε a,ε b a,abb b aa,ε aa,$ a,a a a ab,ε ab,$ b b a,ab decimal token ab,ε a,abb ε,a aa,ε aa,$ ab,ε ab,$ (a) Ancestor-based states (b) Ancestor-based VPPA

ε,ε ε,a aa decimal ab token bbba { } { } a a,a a,ab a ε,ε a,ε a,abb a#a,ε a#a,$ a,ε a,a b b b a#ab,$ a a b a#ab,ε a,ab decimal token a#abb,ε a#abb,ε a,abb ε,a a#a,ε a#a,$ a#ab,ε a#ab,$ (c) Ancestor-sibling-based states (d) Ancestor-sibling-based VPPA

Figure 5.8: VPPA construction examples

106 5.4. LEARNING FROM POSITIVE EXAMPLES

5.4.4 State Merging for Generalization Learning is achieved by merging states in the incrementally updated VPA according to their locality. The locality originates from simple production rules and local typing contexts in the majority of practical schemas. For each typing mechanism, this locality is modeled in a parameterized distinguishing function. The notion of distinguishing function has been introduced by Fernau [125] in the domain of regular languages.

Definition 23 (Distinguishing function [125]). Let Σ be an alphabet, and X is some finite set. Mapping f : Σ X is a distinguishing function if f (w) = f (v) = f (wu) = f (vu) ∗ → ⇒ for all u,v,w Σ . ∈ ∗ Prefixes of event streams are already embedded in state names (follows from Equa- tion 5.14–5.17), and the concept of a distinguishing function is applied by defining two as als functions fk,l and fk,l that partition the named VPA states into equivalence classes. With respect to the parameters, k restricts the locality of a state to the nearest k left siblings, and parameter l restricts the size of the typing context to the nearest l ancestors. Let σ j(w) = w′ be the suffix function for strings that returns w′ of length j with w = uw if w j, or it returns w if w < j. The distinguishing functions are ′ | | ≥ | | as fk,l((u,v)) = (σl(u),σk(v)), (5.18) als fk,l ((r1# #rn,v)) = (σk(rn l+1)# #σk(rn),σk(v)). (5.19) ··· − ··· In both equations, the left-sibling string is restricted to its k-length suffix. Ancestor- based state names are generalized by restricting the ancestor string to its l-length suffix. In Equation 5.19, the concatenation r # #r is an ancestor-sibling string, where left-sibling 1 ··· n strings ri are constrained to their k-length suffixes. as als Theorem 4. Functions fk,l and fk,l are distinguishing. as Proof. For the case of fk,l, suppose, there are two named states (u0,v0) and (r0,s0) repre- as as senting event stream prefixes w and w′ respectively such that fk,l((u0,v0)) = fk,l((r0,s0)). This implies that u0,r0 and v0,s0 share suffixes for some k and l. The distinguishing condition requires that extending w and w′ by some event stream w′′ (Σ T Σ)∗ of ∈ as⊎ ⊎ length w′′ = n generates two named states (un,vn) and (rn,sn) such that fk,l((un,vn)) = as | | fk,l((rn,sn)) also holds. If w = 0 then (u ,v ) and (r ,s ) remain unchanged, and for the case that w > 0, | ′′| 0 0 0 0 | ′′| the left-most event always generates two new states (ui+1,vi+1) and (ri+1,si+1) according to incremental update equations such that u = u a, r = r a, v = v b, s = s b i+1 i · i+1 i · i+1 i · i+1 i · with a (Σ ε ) or b (Σ $,ε ). Shared suffixes in ancestor strings u ,r and left- ∈ ∪ { } ∈ ∪ { } i i sibling strings vi,si are therefore extended simultaneously, suffixes remain equivalent, as as and fk,l((ui,vi)) = fk,l((ri,si)) holds. als The distinguishing condition of fk,l can be shown similarly. The incremental update rules in Equation 5.16 and 5.17 guarantee that shared suffixes in named states (u0,v0) als als and (r0,s0) are equivalently extended such that fk,l ((ui,vi)) = fk,l ((ri,si)) holds. Algorithm5( mergeStates) specifies the state-merging procedure. As inputs, the algorithm expects an intermediate VPA constructed by incVPA, the lexical datatype as als system of the learner, and depending on the chosen state naming scheme, either fk,l or fk,l with fixed parameters k and l. A generalized VPA is returned. The states and transitions are replaced by their respective projections, and the automaton stays deterministic. For the

107 CHAPTER 5. GRAMMATICAL INFERENCE OF XML case that datatype choices are merged into one, a new antichain of datatypes is computed for internal transitions. As all the sets involved in Algorithm5 are necessarily finite, the algorithm always terminates.

Algorithm 5: mergeStates for generalizing the intermediate VPA Input :intermediate VPA A = (Q,q ,F,Q,δ) over Σ T Σ 0 ⊎ ⊎ lexical datatype system (T,φ, , ) ∼s ≤s distinguishing function fk,l for some state naming scheme Output: generalized VPA A = (Q ,q ,F ,Q ,δ ) over Σ T Σ ′ ′ 0′ ′ ′ ′ ⊎ ⊎ 1 Q := f (q) q Q ′ { k,l | ∈ } 2 q0′ := fk,l(q0) 3 F′ := fk,l(q) q F { | ∈/ }( ) call c fk,l p c/p call 4 δ ′ := fk,l(q) fk,l(q′) q q′ δ { −−−−−→/ ( ) | −−→ ∈ } ret c fk,l p c/p ret 5 δ ′ := fk,l(q) fk,l(q′) q q′ δ int { −−−−−→τ τ | −−→int ∈ } 6 δ ′ := fk,l(q) fk,l(q′) q q′ δ int { −→ | −→ ∈ } 7 δ ′′ := /0 // for updated minimally required datatypes τ int 8 foreach (q,q′) τ.q q′ δ ′ do { | ∃τ −→ int∈ } 9 let R = τ q q′ δ ′ int { int| −→ τ∈ } 10 δ ′′ := δ ′′ q q′ τ max R ∪ { −→ | ∈ ≤lex } call int ret 11 δ := δ ′ δ ′′ δ ′ ′ ⊎ ⊎

Example 11. Consider the VPPA constructed in Figure 5.8d using an ancestor-sibling- als based naming scheme. Algorithm5 using fk,l as a distinguishing function with parameters k = 1 and l = 2 returns the VPA in Figure 5.9. Intuitively, k = 1 treats the language within a typing context as 2-testable regular language because only the nearest sibling is considered for computing the next state, and the nearest two ancestors define the typing context. In the resulting VPA, two merge operations have been performed, i.e., als als als als f1,2 ((a,ab)) = f1,2 ((a,abb)) = (a,b) and f1,2 ((a#ab,ε)) = f1,2 ((a#abb,ε)) = (a#b,ε).

ε,ε ε,a

a a,a a a,ε b a,b a a b a#a,ε a#b,ε b b decimal token a#b,$ a#a,$

Figure 5.9: State merging with parameters k = 1,l = 2 for the example in Figure 5.8d

108 5.4. LEARNING FROM POSITIVE EXAMPLES

Refinement of State Naming Functions for Implicit State Merging For better runtime complexity, the functionality of the mergeStates algorithm can be made implicit in the incVPA algorithm by refining the state naming functions as

intName ((u,v)) = (u,σ (v $)), (5.20) k,l k · retName (q, p,e) = (π (p),σ (π (p) lab(e))), (5.21) k,l 1 k 2 · callNameas ((u,v),e) = (σ (u lab(e)),ε), (5.22) k,l l · als callNamek,l ((r1# #rn,v),e) = (rn l+1# #rn#σk(v lab(e)),ε). (5.23) ··· − ··· ·

Functions intNamek,l and retNamek,l are the same for both naming schemes. The typing context and left siblings from the stack state are reused in retNamek,l. For Equa- tion 5.22, the suffix function only needs to be applied to the most-recently appended left-sibling string in the ancestor-sibling string. The well-nested property of the event stream guarantees that ancestor-sibling strings are recursively constructed, r k, and | i| ≤ the number of left-sibling strings n l. Furthermore, in Equation 5.23, no suffix function ≤ is necessary for left-sibling strings in the typing context because r k follows from | n| ≤ construction. Henceforth, in the remainder of this thesis, the refined functions for state naming are used directly in incVPA for implicit state merging instead of calling Algorithm5 (mergeStates). With proper data structures, time complexities are constant for the naming functions.

5.4.5 Generating dXVPAs

Algorithm 6: genXVPA for exporting a dXVPA from a state-merged VPA Input :state-merged VPA A = (Q,q ,F,Q,δ) over Σ T Σ 0 ⊎ ⊎ lexical datatype system (T,φ, , ) ∼s ≤s Output: dXVPA A′ = ( Qm,em,Xm,δm m M,m0,Xm ) over (Σ,M, µ,T,φ) { } ∈ 0 1 M := u (u,v) Q and u = ε { | ∈ ̸ } c/q0 call 2 m := u such that q (u,v) δ 0 0 −−→ ∈ 3 for m M do ∈ 4 Q := (u,v) (u,v) Q and u = m m { | ∈ } 5 em := (m,ε) // from Equations 5.14 and 5.16 c/p ret 6 X := q Q q q δ m { ∈ m | −−→ ′ ∈ } call c/q call 7 δm := q q′ δ q Qm int { τ−−→ ∈int | ∈ } 8 δ := q q δ q,q Q m { −→ ′ ∈ | ′ ∈ m} ret c/p ret 9 δ := q q δ q Q m { −−→ ′ ∈ | ∈ m} ret ret c/p c/p ret 10 δm := δm q q′ q Xm and qm.qm q′ δm call ∪ {call−−→ call| ∈ ∃ −−→ ∈ } 11 δ = δ δ δ m m ⊎ m ⊎ m c/q call 12 if q.q e δ then µ(m) := c ∃ −−→ m ∈ 13 A′ :=minimize(A′)

109 CHAPTER 5. GRAMMATICAL INFERENCE OF XML

For a chosen state naming scheme, i.e., ancestor or ancestor-sibling based, and the refined naming functions for implicit state merging, Algorithm4( incVPA) provides an incremental learner that maintains an internal VPA. A valid dXVPA representation of the intermediate VPA is then generated by Algorithm6( genXVPA). The genXVPA algorithm partitions the named states into modules according to their typing contexts. The initial module m0 is derived from the unique state called by (ε,ε). Schema languages require a single type for the document root, and when all learned examples have the same root element name, (ε,ε) has a single successor state whose typing context becomes m0. For every module, the set of states and exit states are defined, and the callName function guarantees that the entry of module m is always state (m,ε). Module transitions are the VPA transitions that originate from module states. In Line 10, additional return transitions are added to all module exit states such that the single-exit property is satisfied. Mapping µ is derived from call transitions that point to module entry states. Choosing an arbitrary transition to update µ is sound because the state naming functions guarantee that all transitions pointing to entry state (ua,ε) are labeled by element name a. In the last step, the dXVPA is minimized by merging congruent modules. Algorithm genXVPA only deals with finite sets and terminates after minimize terminates.

Minimization

Algorithm 7: minimize a dXVPA by merging equivalent modules Input :dXVPA A = ( Qm,em,Xm,δm m M,m0,Xm ) over (Σ,M, µ,T,φ) { } ∈ 0 Output: minimized dXVPA A

1 while m n.m,n M and m = n and µ(m) = µ(n) and DFA DFA do ∃ ∃ ∈ ̸ m ≃ n 2 let ϕ : Q Q // bisimulation of congruent DFAs n → m c/pi ret 3 for q q δ do n −−→ i ∈ n call call c/pi c/pi 4 δ := δ p e p e i i \{ i −−→ n} ∪ { i −−→ m} ret ret c/pi 5 δ := δ x q x X m m ∪ { m −−→ i | m ∈ m} c/qn call 6 for qn ei δn do ′ret−−→ ∈ 7 δi = /0 c/p j ret 8 for q q δ do i −−→ j ∈ i ret ret c/ϕ(p j) 9 if j = n then δ ′ := δ ′ q ϕ(q ) i i ∪ { i −−−−→ j } ret ret c/p j 10 else δ ′ := δ ′ q q i i ∪ { i −−→ j} ret ′ret 11 δi := δi 12 if n = m0 then m0 := m 13 M := M n // remove module n \{ } 14 µ(n) := /0

The minimize routine in Algorithm7 compares pairs of modules which are called by the same element name and have congruent DFA representations according to the construction in Theorem1. If two congruent modules m and n exist, module n is folded

110 5.4. LEARNING FROM POSITIVE EXAMPLES

into m by redirecting calls and returns to corresponding states in Qm. Bijective function ϕ captures the relation between equivalent states in m and n according to bisimulation. First, all calls to n and their corresponding returns are transferred to m. Second, the returns of outgoing calls from n are directed to equivalent states in m. Finally, module n is removed. Algorithm minimize terminates because the number of comparisons and all involved sets are finite. However, it should be noted that the routine is not optimal and has polynomial time complexity from repeated comparisons of pairs of modules and bisimulating their DFA representations.

Incremental Learner Algorithm8 puts all the pieces together for an incremental learner. For a given lexical datatype system, three refined state naming functions for a chosen naming scheme, and chosen parameters k and l, the incremental learner computes a dXVPA from document event stream w. If the dXVPA has been constructed from ancestor-based states, a transla- tion to an XSD schema is possible based on Theorem1, and according to Theorem2, the dXVPA can then be translated into a cXVPA for stream validation of future documents. The internal VPA representation A is considered persistent for incremental updates. The need for an intermediate VPA representation can be explained as follows. Suppose a dXVPA is incrementally updated without keeping an intermediate VPA. After learning event stream wi, modules m and n could be congruent and folded by the minimization routine. However, a later event stream w j contains evidence that m and n are not congruent, but they have already been folded. Maintaining the intermediate VPA representation prevents information loss from premature minimization.

Algorithm 8: Incremental learner Input :document event stream w lexical datatype system dts = (T,φ, , ) ∼s ≤s refined naming functions f n = (intNamek,l,callNamek,l,retNamek,l) parameters k and l Output: dXVPA A′

1 persistent VPA A, and initially, A = ( (ε,ε) ,(ε,ε), /0, (ε,ε) , /0) { } { } 2 A := incVPA(A,dts, f n,w) 3 A′ := genXVPA(A)

Example 12. The dXVPA illustrated in Figure 5.10 is the result of genXVPA applied to the state-merged VPA in Figure 5.9. For convenience, the start state q0 and final state q f according to dXVPA semantics have been added. States are partitioned into three modules, and from the chosen parameter l = 2, the learner distinguishes two different content models for the a-tag; while the root-tag expects a nested a-tag and one or more nested b-tags, the nested a-tag expects character data of datatype decimal. Furthermore, module a#b has two exit states, and to satisfy the single-exit property, additional returns have been added by genXVPA, so every exit state behaves the same.

Set-Driven Learner The computational complexity discussion in the next section reveals that minimize has polynomial worst-case complexity with respect to the length of document event streams.

111 CHAPTER 5. GRAMMATICAL INFERENCE OF XML

# a a a a/εa a/q0 ε ε q0 b/a decimal a a#b a/εa $ a b/ba ε

token a/q0 q f b $ b/aa,b/ba

b/aa,b/ba

Figure 5.10: Generated dXVPA from the VPA in Figure 5.9

For efficient learning from a set of examples S+, the incremental learner can be refined to a set-driven learner as proposed in Algorithm9, where genXVPA is called only once for a set of examples. Again, the internal VPA representation is considered persistent, and the learner incrementally learns from sets of examples, e.g., w , w ,...,w , w ,.... { 1} { 2 i} { i+1} When examples are presented as singletons to the set-based learner, the incremental capabilities are the same as in Algorithm8.

Algorithm 9: Set-driven learner Input :finite set of examples S+, i.e., document event streams lexical datatype system dts = (T,φ, , ) ∼s ≤s refined naming functions f n = (intNamek,l,callNamek,l,retNamek,l) parameters k and l Output: dXVPA A′

1 persistent VPA A, and initially, A = ( (ε,ε) ,(ε,ε), /0, (ε,ε) , /0) + { } { } 2 for w S do A := incVPA(A,dts, f n,w) ∈ 3 A′ := genXVPA(A)

5.4.6 Parameter Semantics The parameters k and l characterize the hypothesis space of the learner. The lower bound for parameters is k = l = 1, and the resulting state-merged VPA is congruent for both ancestor- and ancestor-sibling-based naming schemes under these settings. This parameter setting implies that only the nearest ancestor and left-sibling guide the typing mechanism. Expressiveness of the learner would be restricted to a subclass of DTDs because typing contexts are named after elements, and production rules are assumed to be SOREs because parameter k = 1 implies a 2-testable regular language within a typing context. The expressiveness of the learner can then be increased by choosing the ancestor-sibling-based naming scheme or increasing the parameters. Furthermore, for k = 1 and l > 1, the ancestor- and ancestor-sibling-based naming scheme also produce congruent automata. The upper bound for parameter l is the expected nesting depth. as Let Lk,l be the learnable language class for fixed parameters k and l under ancestor- based states. Then Las Las is a is a more fine-grained language class for increased k′,l′ ⊇ k,l 112 5.5. LEARNER PROPERTIES

parameters k′ k and l′ l. The same relation holds for ancestor-sibling-based states, i.e., als als ≥ ≥ as als L L for k′ k and l′ l. For fixed parameter k = 1, L = L . Ancestor-based k′,l′ k,l 1,l 1,l ⊇ ≥ ≥ as st states restrain the learnable language class to single-type EDTDs, i.e., Lk,l ( EDTD , and ancestor-sibling-based states restrain the learnable class to EDTDs having the 1PPT als rc property, i.e., Lk,l ( EDTD . For increased k and l, more examples are necessary to reach a stable representation. The expressiveness of the target language is not known to the learner in the proposed setting of anomaly detection. An inferred automaton is therefore an approximation of the true language, and the quality of approximation can only be experimentally evaluated. Chapter6 presents several experiments and discusses choices for naming scheme and parameters for the trade-off between fast stabilization and quality of the approximation.

5.5 Learner Properties

Theorem 5. The proposed set-driven learner in Algorithm9 identifies the subclass as als of mixed-content 1-pass preorder locally typed XML, i.e., dEDTDk,l ( dEDTDk,l ( dEDTDrc, from positive examples and returns a dXVPA without dead states. The learner is (1) incremental, (2) set-driven, (3) consistent, (4) conservative, and (5) strong-monotonic.

Proof. (No dead states) The property follows from state naming of prefixes of well- matched event stream examples according to refined state naming functions in Equa- tions 5.20–5.23. Therefore, by construction in Algorithm4, all generated states lie on paths from the start state to accepting states. (Incremental learning) In Gold’s identification in the limit from positive exam- ples [155] (Definition8), a learner receives examples E(1),E(2),..., and for every example, a new hypothesis Di is computed. This property follows directly from algo- rithms incVPA and genXVPA. An incremental learner (Algorithm8) receiving document event streams w1,w2,... computes dXVPAs A1′ ,A2′ ,... respectively. (Set-driven learning) In general, a set-driven learner is insensitive to the order of presented examples [198]. This property is satisfied by the proposed learner because states and transitions are treated as sets in Algorithm4. The sets are only updated if new evidence has been observed from a prefix of an example’s event stream and remain + + unchanged otherwise. Therefore, dXVPAs AS′ + and AR′ + are the same if S = R , and order-independence follows from the set-driven property. (Consistent, conservative, and strong-monotonic learning) A learner is consistent if all examples presented so far to the learner are accepted by the respective hypothesis [14], i.e., j.i < j = E(i) L(D ). Angluin [14] furthermore characterizes a learner to ∀ ⇒ ∈ j be conservative if a hypothesis is kept as long as no contradicting evidence has been observed, i.e., j.i < j and D = D = E( j) / L(D ). Jantke [199] defines strong- ∀ i ̸ j ⇒ ∈ i monotonically in inductive inference as the property that the language coverage of the returned device increases with every example, i.e., for examples E(1),E(2),... a learner returns D ,D ,... and L(D ) L(D ) i < j. 1 2 i ⊆ j ⇐⇒ In the proposed learner, these properties follow from updating sets of states and transitions in the intermediate VPA in Algorithm4 using the refined state naming functions for implicit state merging. States and call and return transitions are never deleted, and new ones are only added if new evidence is presented. Also, an internal transition on datatype τ is only removed if a new transition on τ′ is added such that τ′ is more

113 CHAPTER 5. GRAMMATICAL INFERENCE OF XML

general and covers the language of τ, i.e., τ

5.5.1 Computational Complexity The variables with respect to time and space complexity in the proposed learner are the number of element and attribute names Σ , the (worst-case) length of a document event | | stream w (denoted as n), the worst-case length of character data lab(e) in an event | | | | stream (denoted as m), the number of examples S+ , and parameters k and l. | |

Update of an Intermediate VPA Algorithm4( incVPA) iterates over a datatyped event stream and creates a state and one or more transitions for every prefix. The number of characters events in an event ⌊ n ⌋ stream is 2 in the worst case. While startElement and endElement events update add a single state and transition in effectively constant time, a characters event adds one state and multiple internal transitions. Algorithms pref and minLex are called every time for computing the minimally required datatypes to update existing datatype transitions using minReqInc. Algorithm1( minLex) is a bottom-up search in topological order in poset (T, ), ≤lex and regular language membership for lab(e) is checked T times for a single characters | | event e, i.e., O( T m). | | · The optimized variant minLextf is linear in m, i.e., O(m), but the precomputed DFAtf j T from Algorithm2 has worst-case size O(2 ·| |), where j is the length of the longest regular expression in lexical space definitions φ. The exponential blowup is a well-known result in formal language theory for translating a regular expression into a DFA [185]. The T DFAs are unified in the product automaton construction. | | Algorithm3( pref) makes O( T 2) pairwise comparisons in the worst case. | | Function minReqInc calls max lex to compute a set of minimally required datatypes, where the naive algorithm has time≤ complexity O( T 2) from of pairwise comparisons of | | datatypes with respect to their order. The worst-case time complexity for learning an event stream w in incVPA without optimizations is therefore O(⌈ n ⌉ + ⌊ n ⌋ m T 4). The complexity is multiplied by S+ 2 2 · · | | | | for the set-based learner. For a constant datatype system, the time complexity is therefore effectively linear, i.e., O( S+ (⌈ n ⌉ + ⌊ n ⌋ m)). The datatype-system-dependent time | | · 2 2 · complexity can be reduced by minLextf and an improved topological search in pref and max toward O( T + lex ). ≤lex | | | ≤ | With respect to worst-case space complexity of the intermediate VPA, every prefix of an event stream creates a state and one or more transitions. While calls and returns create at most a single transition, character data creates T transitions in the worst case. | | Parameters k and l in state naming functions entail a strong combinatorial upper bound on the number of possible states; in particular, the combinatorial worst case for ancestor- based states is O( Σ l), and for ancestor-sibling-based states it is O( Σ l k). The worst-case | | | | · space complexity of an intermediate VPA for a single example is the sum of number of

114 5.6. ANOMALY DETECTION REFINEMENTS states, the number of call and return transitions, and the number of internal transitions, i.e., O(n + ⌈ n ⌉ + ⌊ n ⌋ T ). For the set-based learner or after learning multiple examples, 2 2 · | | this complexity is multiplied by S+ , and for a constant datatype system, worst-case | | space complexity becomes effectively linear i.e., O( S+ n). | | ·

Export of a dXVPA Algorithm6( genXVPA) generates modules, updates return transitions, generates mapping µ, and minimizes the automaton by merging congruent modules. Modules are state partitions of the incrementally updated VPA which has been inferred from S+ examples. | | In the worst case, every state is in a unique module, and the number of states and the number of modules are therefore equal, i.e., O( S+ n), and will be denoted as i. Set | | · operations are assumed to be effectively constant, guaranteeing the single-exit property can be made implicit by choosing a proper data structure for dXVPA modules, and minimization needs a more detailed discussion. Algorithm7( minimize) is called by genXVPA to reduce the size of a dXVPA. The algorithm compares pairwise distinct modules and restarts after a merge. In the worst case, there are i2 + (i 1)2 + (i 2)2 + + 1 candidates for comparisons which is a − − 1 ··· 2i3+3i2+i square pyramidal number [13], i.e., 6 i(i + 1)(2i + 1) = 6 . This function does not grow faster than O(i3). For a single comparison, two DFAs according to Theorem1 are derived from return transitions in all modules, and two DFAs are congruent if there is a i bisimulation ϕ. In the worst case, the two DFAs have 2 states, and computing the naive bisimulation needs O(i2) steps for pairwise comparisons. Thus, the comparisons and bisimulations dominate the nested iterations in minimize, time complexity does not grow faster than O(i5) = O((n S+ )5), and overall time complexity for genXVPA becomes · | | O((n S+ )+(n S+ )5) in the worst case. The average time complexity should be lower, ·| | ·| | and minimize can immediately be improved to O(i4 logi) by selecting a better algorithm for bisimulation [62]. With respect to worst-case space complexity, the size of a dXVPA is the sum of number of states and transitions after learning S+ examples. If minimization is not | | possible, the worst-case size of a dXVPA remains the size of the intermediate VPA, i.e., O( S+ (n + ⌈ n ⌉ + ⌊ n ⌋ T )), which is for a constant datatype system effectively linear | | · 2 2 · | | in the size of the example set and length of examples, i.e., O( S+ n). | | ·

5.6 Anomaly Detection Refinements

For anomaly detection, some of the learner properties need to be deliberately weakened. For example, if an attacker is able to poison the training data once, the consistency and strong-monotonicity properties ensure that the output of the learner is forever poisoned. Several anomaly detection refinements are therefore discussed.

5.6.1 Event Stream Security Heuristics A first refinement concerns the representation of event streams. The lengths ofnames- paces, elements, and attributes are unbounded in XML [451], and as discussed in Chap- ter3, an oversized token is already a trivial parsing attack. Defining a fixed upper bound for element and attribute lengths in event streams could mitigate the problem.

115 CHAPTER 5. GRAMMATICAL INFERENCE OF XML

Another trivial attack is coercive parsing by continuously nesting tags to exhaust data structures in the parser. In particular, two cases can be distinguished: horizontal and vertical exhaustive iteration. The first case concerns recursive nesting of XML tags, and in a cXVPA, the affected structure during stream validation is the stack. A trivial countermeasure is an upper bound on the stack height. Repetition of well-matched XML is the second case which could lead to very large documents that are still valid. The cXVPA representation is not affected because transitions inferred by the proposed learner are always deterministic, and horizontal repetition keeps the stack height within a certain range. However, the designated service could eventually fail in handling the size of the document. A potential countermeasure is inferring counting restrictions and implementing them in cXVPA runs; however, this is still an open problem in the proposed learner.

5.6.2 Filtering If an attacker can successfully hide an attack in training data, the learned cXVPA will not be able to identify the attack, and possibly others, during stream validation. Poisoning attacks are generally a threat to learning-based security mechanisms, and a straightforward approach is to remove well-known attacks or inconsistent examples from training data. Examples for filtering are:

• schema validation of document parts, where a schema is available • policy-based checking for XML signatures and ID-based references [53] • pattern matching for attack or virus signatures in character data

5.6.3 Unlearning and Evolution Learning in an anomaly detection scenario is a continuous progress, and there could be the case, that a once learned example wi is identified as violating at a later time, and wi should be unlearned. The proposed incremental and set-based learner can however not deal with such a situation because implicitly merged states and transitions reside in sets and are not quantified. To introduce an unlearning capability, weight counters are proposed for VPA states and transitions. This approach is motivated by previous own work [228, 229], and a counter reflects how often a certain state or transition has been learned from examplesso far. An example wi can be unlearned by traversing states and transitions in a run on wi, decrementing respective counters, and deleting states and transitions with zero weight. The first step is a refinement of Algorithm4( incVPA) by introducing weight counters. Algorithm 10( incWeightedVPA) returns a VPA A and associated weight counters ωQ, ωF , and ωδ . The major difference to incVPA is that minReqInc in case of a characters event is not computed. Algorithm incWeightedVPA terminates for finite-length document event streams, and counter increments are assumed to be effectively constant. Time complexity for learning a set of examples S+ remains O( S+ (⌈ n ⌉ + ⌊ n ⌋ m)) for a | | | | · 2 2 · constant datatype system. For start state q0, the weight is defined as ωQ(q0) = ∞, so it can never be removed. The returned VPA A from incWeightedVPA still captures every state and transition learned from all examples so far.

116 5.6. ANOMALY DETECTION REFINEMENTS

Algorithm 10: incWeightedVPA for incrementally updating an intermediate VPA Input :intermediate VPA A = (Q,q ,F,Q,δ) over Σ T Σ 0 ⊎ ⊎ lexical datatype system (T,φ, , ) ∼s ≤s functions callName, intName, and retName for state naming weight counters ωQ, ωF , and ωδ well-matched document event stream w Output: updated VPA A, updated weight counters ωQ, ωF , and ωδ

1 s := // empty stack ⊥ 2 q := q0 // current state 3 for e in datatyped(w) do 4 switch eventType(e) do 5 case startElement 6 Σ := Σ lab(e) // update element names ∪ { } 7 q′ := callName(q,lab(e)) 8 ωQ(q′) := ωQ(q′) + 1 call call lab(e)/q 9 δ := δ q q′ lab(e)/q ∪ { −−−−−→ lab}(e)/q 10 ω (q q ) := ω (q q ) + 1 δ −−−−−→ ′ δ −−−−−→ ′ 11 s := s q · 12 q := q′

13 case endElement 14 let s = vp // p is top of the stack 15 q′ := retName(q, p,lab(e)) 16 ωQ(q′) := ωQ(q′) + 1 ret ret lab(e)/p 17 δ := δ q q ∪ { −−−−−→ ′} lab(e)/p lab(e)/p 18 ω (q q ) := ω (q q ) + 1 δ −−−−−→ ′ δ −−−−−→ ′ 19 s := v 20 q := q′

21 case characters, where lab(e) are minimally required datatypes 22 q′ := intName(q) 23 ωQ(q′) := ωQ(q′) + 1 int int τ 24 δ := δ q q′ τ lab(e) ∪ { −→ | τ∈ } τ 25 for τ lab(e) do ω (q q ) := ω (q q ) + 1 ∈ δ −→ ′ δ −→ ′ 26 q := q′

27 F := F q ∪ { } 28 ωF (q) := ωF (q) + 1

117 CHAPTER 5. GRAMMATICAL INFERENCE OF XML

Algorithm 11: trim removes zero-weight states and transitions from a VPA Input :intermediate VPA A = (Q,q ,F,Q,δ) over Σ T Σ 0 ⊎ ⊎ lexical datatype system (T,φ, , ) ∼s ≤s weight counters ωQ, ωF , and ωδ Output: VPA A′ = (Q′,q0,F′,Q′,δ ′)

call call c/q c/q 1 δ ′ := δ q q ω (q q ) = 0 \{ −−→ ′ | δ −−→ ′ } ret ret c/q c/q 2 δ ′ := δ q q ω (q q ) = 0 \{ −−→ ′ | δ −−→ ′ } ′int int τ τ 3 δ := δ q q′ ωδ (q q′) = 0 int \{ −→ | −→ } 4 δ ′′ := /0 τ int 5 foreach (q,q′) τ.q q′ δ ′ do { | ∃τ −→ int∈ } 6 let R = τ q q′ δ ′ int { int| −→ τ∈ } 7 δ ′′ := δ ′′ q q′ τ max R ∪ { −→ | ∈ ≤lex } call ret int 8 δ = δ ′ δ ′ δ ′′ ′ ⊎ ⊎ 9 Q := Q q ω (q) = 0 ′ \{ | Q } 10 F := F q ω (q) = 0 ′ \{ | F }

Algorithm 11( trim) maps A and its counters to a new VPA A′, where all states and transitions with zero weight have been removed. For computing correct datatype choices, the trim algorithm identifies pairs of states connected by internal transitions and restricts the datatype choice to respective maximal datatypes in the spirit of Equation 5.3. The trim algorithm terminates because all involved sets are finite. With respect to time complexity, for a constant datatype system and effectively constant set operations, the number of state pairs with internal transitions in between is linear in the loop, i.e., O( S+ n), and this growth rate also reflects the overall worst-case time complexity. | | · Furthermore, algorithms incWeightedVPA and trim integrate seamlessly for dXVPA export by

genXVPA’(A,dts,ωQ,ωF ,ωδ ) = genXVPA(trim(A,dts,ωQ,ωF ,ωδ )). (5.24)

Algorithm 12( unlearn) illustrates how counters are decremented for some document event stream wi that has already been learned at an earlier time. The algorithm simulates a run on wi and traverses exactly the states and transitions in the order they have been added by incWeightedVPA, and weight counters are decremented accordingly. In case of a characters event e, one datatype is picked from the minimally required datatypes to locate the next state, and internal transitions between current and next state are decremented for all minimally required datatypes. Algorithm 12 terminates for a finite-length event stream w , and time complexity is linear, i.e., O(⌈ n ⌉ + ⌊ n ⌋ m), for a constant datatype i 2 2 · system and constant counter operations.

Corollary. If a VPA is learned from event streams w1,...,wi,...,w j and wi is unlearned, the resulting VPA is consistent with w1,...,wi 1,wi+1,...,w j. − An unlearning capability also enables evolution of a dXVPA over time by unlearning outdated examples, e.g., by keeping track of examples and annotating them by timestamps, to address concept drift of the interface language over time.

118 5.6. ANOMALY DETECTION REFINEMENTS

Algorithm 12: unlearn a once learned document event stream Input :intermediate VPA A = (Q,q ,F,Q,δ) over Σ T Σ 0 ⊎ ⊎ lexical datatype system (T,φ, , ) ∼s ≤s weight counters ωQ, ωF , and ωδ well-matched document event stream w Output: updated weight counters ωQ, ωF , and ωδ

1 s := // empty stack ⊥ 2 q := q0 // current state 3 for e in datatyped(w) do 4 switch eventType(e) do 5 case startElement call 6 q′ := δ (q,lab(e)) 7 ωQ(q′) := ωQ(q′) 1 lab(e)/q − lab(e)/q 8 ω (q q ) := ω (q q ) 1 δ −−−−−→ ′ δ −−−−−→ ′ − 9 s := s q · 10 q := q′

11 case endElement 12 let s = vp // p is top of the stack ret 13 q′ := δ (q,lab(e), p) 14 ω (q ) := ω (q ) 1 Q ′ Q ′ − lab(e)/p lab(e)/p 15 ω (q q ) := ω (q q ) 1 δ −−−−−→ ′ δ −−−−−→ ′ − 16 s := v 17 q := q′

18 case characters, where lab(e) are minimally required datatypes int 19 q := δ (q,τ) for some τ lab(e) ′ ∈ 20 ωQ(q′) := ωQ(q′) 1 − τ τ 21 for τ lab(e) do ω (q q ) := ω (q q ) 1 ∈ δ −→ ′ δ −→ ′ − 22 q := q′

23 ω (q) := ω (q) 1 F F −

119 CHAPTER 5. GRAMMATICAL INFERENCE OF XML

5.6.4 Sanitization Filtering and unlearning are useful for preventing already known attacks and reacting to identified attacks respectively. Unknown poisoning attacks in training data should alsobe addressed, and Cretu et al. [101] propose a sanitization process under the assumption that unknown poisoning attacks are rare. In the proposed learner using counters, the counter values reflect frequencies from training data, in particular, the weight of a state equals the sum of all weights of incoming transitions and the sum of all weights of outgoing transitions. Sanitization can be performed by decrementing counter values of transitions by one and recomputing weights of states. Low-frequent states and transitions from training data are therefore removed by trim at a later stage. Decrementing weight counters is only sound under the assumptions that accepted documents are highly frequent and tend to have similar structures and datatypes, and unknown attacks are rare and have significantly different structure or datatypes.

Algorithm 13: sanitize removes low-frequency states and transitions Input :intermediate VPA A = (Q,q ,F,Q,δ) over Σ T Σ 0 ⊎ ⊎ weight counters ωQ, ωF , and ωδ Output: sanitized weight counters ωQ′ , ωF′ , and ωδ′

1 for any defined transition x do ω (x) := ω (x) 1 δ′ δ − 2 for q Q do ∈ 3 ωQ′ (q) := ∑transition x to q ωδ′ (x) 4 ωF′ (q) := ωQ′ (q)

5 A′ := trim(A,dts,ωQ′ ,ωF′ ,ωδ′ ) 6 let Qu be the unreachable states in A′ 7 if (F Q ) = /0 then // revert all changes ′ \ u 8 ωQ′ := ωQ 9 ωF′ := ωF 10 ωδ′ := ωδ 11 else // eliminate weights of unreachable states 12 for q Q do ∈ u 13 for any transition x to q do ωδ′ (x) := 0 14 ωQ′ (q) := 0 15 ωF′ (q) := 0

Algorithm 13( sanitize) has two stages. The first stage decrements weight counters of transitions and recomputes weights of states. In the second stage, an intermediate trimmed VPA is generated from updated weight counters, and unreachable states are identified. If no final state is reachable in the intermediate trimmed VPA, allweight counters are restored because sanizitation is not applicable. Weights of unreachable states and the transitions to them are set to zero for eliminating the unreachable states in a later trim procedure. The algorithm terminates because operations in sanitize are on finite sets, and reachability is decidable in VPAs [10]. Sanitization affects the properties of the learner: a sanitized hypothesis is not consis- tent with previously learned examples, modifying the hypothesis without evidence is not

120 5.6. ANOMALY DETECTION REFINEMENTS conservative, and the accepted languages before and after sanitization could be signifi- cantly different and violate monotonicity. Furthermore, once sanitize has been applied, unlearn is not sound anymore because counter semantics have changed. Algorithm 13 should only be applied after learning a sufficiently large number of examples to reduce its side effects.

121 CHAPTER 5. GRAMMATICAL INFERENCE OF XML

122 Chapter 6

Experimental Evaluation

For a proof of concept, this chapter presents the experimental evaluation in an anomaly detection setting. Section 6.1 summarizes how the learner proposed in Chapter5 has been implemented. Experimental results need to be quantified, and Section 6.2 introduces measures for detection performance, estimation of learning progress, and operational performance. The evaluated scenarios and datasets are presented in Section 6.3, and the chapter closes with the measurements from experiments in Section 6.4.

6.1 Implementation

The learner and validator components have been implemented in the Scala 2.11.7 [314] programming language and compiled in the Java Runtime Environment version 8u51. The dk.brics.automaton [277] library enables Unicode regular expressions and au- tomata operations in the prototype for cXVPA predicates and the lexical datatype system. Efficient stream processing of XML documents is done by instrumenting the Java-native javax.xml.stream StAX processor. The prototype covers: the document event stream representation; models for VPAs, dXVPAs, and cXVPAs; the lexical datatype system for algorithms minLex, pref, and minReqInc, including the computation of and from XSD lexical space definitions; ≤lex ≤s algorithms incVPA, incWeightedVPA, genXVPA, and minimize using the refined state naming functions for the incremental and set-based learner; algorithms unlearn and sanitize for a learner refined by weight counters; and all necessary performance evaluation procedures. The prototype lacks linear-time lexical subsumption by a precomputed DFAtf because it had sufficient operational performance for a proof of concept. Event stream security heuristics proposed in the previous chapter have also been skipped.

6.2 Measures

Techniques for intrusion detection are usually heuristics that need experimental evaluation to discuss performance; however, sound evaluation is hard [150, 392]. For the proposed anomaly detection technique, three aspects of performance are considered: • detection performance • learning progress • operational performance

123 CHAPTER 6. EXPERIMENTAL EVALUATION

6.2.1 Detection Performance An experiment for estimating performance requires a representative dataset of positive and negative examples for a particular evaluation scenario, i.e., normal and attack-carrying XML documents, and the ground truth. A dataset with its ground truth is commonly referred to as labeled dataset, and a binary confusion matrix as shown in Table 6.1 is constructed by counting respective classification results.

Table 6.1: The binary confusion matrix for classification

Truth Attack Normal

Attack True positive (TP) False positive (FP) Prediction Normal False negative (FN) True negative (TN)

The binary confusion matrix distinguishes the classes Normal and Attack for accept- able and violating XML documents respectively. If the validator component correctly identifies normal and attack-carrying examples, the true positive (TP) and true nega- tive (TN) counters are incremented respectively. Errors increment the false positive (FP), i.e., false alarms, and false negative (FN) counters. For a dataset S, the size equals the grand sum of the confusion matrix, i.e., S = TP+TN +FP+FN, and various detection | | performance measures can be derived. FP TP FPR = (6.1) TPR = Re = (6.2) FP + TN TP + FN

TP Pr Re Pr = (6.3) F = 2 · (6.4) TP + FP 1 · Pr + Re

Equations 6.1–6.4 summarize most popular measures. The false-positive rate (FPR) in Equation 6.1, also referred to as false-alarm rate, is the ratio between false-positive classifications to actual normal instances. Equation 6.2 computes the detection rate of a classifier, typically referred to as true-positive rate (TPR) or recall (Re). A perfect IDS would deliver TPR = 100% and FPR = 0% in all possible experiments. As a remark, if a classifier has a variable threshold for decision making, a Receiver Operator Characteristic (ROC) curve can be drawn to visualize the relationship between TPR and FPR under changing thresholds, and the area under the curve (AUC) is another popular measure of a classifier’s performance considering a variable decision threshold in a single value.

False Positive Paradox. While TPR and FPR are important measures for applications, where classes are equally distributed, the expressiveness of FPR can suffer in the intrusion detection setting. The distribution of attacks and normal instances is generally not known beforehand; attacks could be very infrequent and the distribution could be heavily skewed. According to Axelsson [34], this problem leads to a false positive paradox, where an intrusion detection system generates an unacceptable amount of false alarms even when the FPR seems sufficiently low. For network-packet-based intrusion detection, Axelsson 5 furthermore argues that FPR < 10− is needed to be of practical use. In this thesis, FPR is only used for illustrating the convergence of false positives to zero during the learning.

124 6.2. MEASURES

Recall and Precision. For binary classification, where the distribution of classes is skewed or unknown, Davis and Goadrich [107] recommend recall and precision (PR) as performance measures. Equation 6.3 defines PR, and the value is independent from TN. Intuitively, PR reflects a classifier’s reliability of a positive classification, e.g., class Attack. The F1-measure (F1) in Equation 6.4 is the harmonic mean of Re and Pr to capture the overall performance in a single value. A perfect IDS would deliver Re = 100%, Pr = 100%, and F1 = 100% in all possible experiments, and maximizing precision and F1 implicitly minimizes the FPR. As a remark, for a classifier with variable decision threshold, the precision-recall curve captures the relationship between precision and recall, and its area under the curve (PR-AUC) is another measure that reflects the overall detection performance of a classifier with a variable decision threshold. For the thesis, Re, Pr, and F1 are the performance measures of choice.

6.2.2 Learning Progress The identification in the limit from positive examples setting has a convergence point N(E) for a particular enumeration of examples, and the incrementally inferred models by the learner do not change after reaching convergence. However, the proposed learner as for anomaly detection in XML is a heuristic constrained to language class dEDTDk,l als or dEDTDk,l , and the actual language class and target schema are not known to the learner. To estimate the convergence point, the mind changes after learning an example are counted.

Definition 24 (Mind changes [178]). Let VPAi = (Qi,q0,Fi,Qi,δi) be the intermediate VPA in in Algorithm4 learned from example documents w1,...,wi, and VPAi+1 = (Qi+1,q0,Fi+1,Qi+1,δi+1) is the incremental automaton from learning example wi+1. The number of mind changes is MC = ( Q Q ) + ( δ δ ) after learning i+1 | i+1| − | i| | i+1| − | i| example wi+1. As a heuristic, convergence for the proposed learner has been reached if there are zero mind changes after learning j examples, i.e., ∑i,...,i+ j MCi+ j = 0.

In the intermediate VPA with weights in Algorithm 10, a mind change is counted if the weight of a state or transition changes from zero to one. The proposed learner, parameterized by fixed k and l, has a strong combinatorial upper bound on the number of named states and transitions for a finite number of elements. Therefore, in the worst case, convergence is reached when all possible states and transitions have been learned. Another potential measure of the learning progress is the distance to the hidden target, e.g., a schema. However, schemas might not be available or extension points could falsify the target language, and this direction has therefore not been further investigated.

6.2.3 Operational Performance The third aspect of performance in the proposed security monitor is operational perfor- mance which concerns processing speed in the proposed learner and validator components. In particular, throughput during learning and validation is of interest because slow through- put in the security monitor could delay messages and eventually violate SLAs. Processing is typically document based but computational complexities are given with respect to the length of a document event stream. Throughput for both learning and validation is therefore measured in MBit/s.

125 CHAPTER 6. EXPERIMENTAL EVALUATION

6.3 Evaluation Scenarios

For the evaluation of the proposed algorithms, four datasets have been generated to model different scenarios. Every dataset has two sets of XML documents: a training set of acceptable examples and a testing set of violating documents. A learner infers a dXVPA from normal data in the training set, and performance is mea- sured by validating the examples in the testing set using the corresponding cXVPA and a binary classification setting. Table 6.2 describes the properties of the four datasets. While datasets Carsale and Catalog have been synthetically generated, datasets VulnShopOder and VulnShopAuthOder are actual message recordings in a SOAP/WS-* web service simulation. The following types of attacks have been placed in testing data:

• XML tampering. The structure of the document is deliberately modified, e.g., by deleting elements or changing their order, for gathering information about XML processing capabilities in the message receiver.

• High node count. For a Denial-of-Service attack against DOM-based processing, a large number of elements or attributes are placed in the document, preferably at an extension point.

• Coercive parsing. Also for Denial-of-Service in the XML processing model, a large number of nested elements are placed in the document, preferably at an extension point.

• Script injection. Under the assumption that a part of the document is processed by a web browser at a later stage, JavaScript code is placed in the document, and CDATA fields are exploited for embedding angled brackets as character data.

• Command injection. Under the assumption that some element content or attribute value in the document is concatenated into a Unix shell command at a later stage, an additional shell command is embedded by exploiting CDATA fields for encoding ampersand symbols.

• SQL injection. Under the assumption that element or attribute values in the document are concatenated in an SQL statement at a later stage, tests for SQL injection are added, e.g., an apostrophe plus fragments of an SQL statement.

• SSRF. One element in the document receives a special xsi:schemaInstance attribute that refers to an external resource which could eventually trick a vulnerable XML processor into requesting the resource.

• XML injection. This class of attacks includes placement of additional elements in the document to affect processing at a later stage, e.g., repeated elements for confusing an XPath query that matches the first occurrence of a particular element or placing a script tag for JavaScript injection without CDATA fields.

• Signature wrapping. A signed SOAP message is reorganized such that the sig- nature’s timestamp and the message body can be arbitrarily modified while the signature verification component in the receiver passes the message. Wrapper elements for original structures are preferably placed at extension points.

126 6.3. EVALUATION SCENARIOS

Table 6.2: Properties of the evaluation datasets

Training Testing

Dataset Normal Normal Attacks XML tampering High node count Coercive parsing Script injection Command injection SQL injection SSRF XML injection Signature wrapping

Carsale 50 1000 17 1 3 2 3 2 2 1 3 0 Catalog 100 2000 17 1 3 2 3 2 2 2 2 0 VulnShopOrder 200 2000 28 2 4 2 5 3 5 3 4 0 VulnShopAuthOrder 200 2000 78 0 0 0 0 0 0 0 0 78

6.3.1 Synthetic Datasets The Carsale and Catalog datasets reproduce XML documents that could eventually occur in an interaction, e.g., in a RESTful resource or SOAP/WS-* message body. The documents of class Normal are according to a schema without extension points, and the schema is assumed to be unknown to a learner. Both datasets have been generated using the ToXGene tool by Barbosa et al. [35]. A ToXGene profile is similar to an XSD schema but extended by stochastic statements and examples of character data, so real-looking documents can be generated. Furthermore, the expressiveness of the underlying schema in both cases is beyond DTD because of nested elements with identical names and different types.

6.3.2 Simulated Datasets For providing a realistic setting in the evaluation, the VulnShopService has been imple- mented for testing attacks without legal implications and for generating the VulnShopOrder and VulnShopAuthOrder datasets. Conceptually, the service provides a SOAP/WS-* interface for a virtual business and offers two operations, so customers can place orders: • Operation order. A customer sends a SOAP message that contains all necessary information for ordering items, in particular, a recipient identifier, a timestamp, and one or more items and their respective quantities. Every item has a name, and some items have a barcode. The operation generates a response message for notifying the client that the order has been received. Incoming messages for this operation are collected in the VulnShopOrder dataset. • Operation authorizedOrder. If a client is trusted, the hypothetical order process can be significantly improved, e.g., automated billing. However, the authenticity of the client and integrity of the order needs to be guaranteed. A WS-Trust policy is therefore shared between trusted clients and the service to specify the use of a digital signature in SOAP messages directed to the authorizedOrder operation. The signature verifies authenticity and integrity, the order is immediately accepted, and the response message notifies that the order is processed. The VulnShopAuthOrder dataset is a collection of incoming SOAP messages with digital signatures. The VulnShopService is programmed in Java on top of Apache Axis2 1.6.0 [21] and uses the Apache Rampart 1.6.0 [26] module for WS-Trust and WS-Security. To simulate

127 CHAPTER 6. EXPERIMENTAL EVALUATION a realistic software development approach, the software development has strictly followed the Axis2 and Rampart examples: the business logic is implemented in Java classes (i.e., Java Beans), Java2WSDL automatically generates an Axis2 service and a WSDL from the Java classes, and WS-Trust settings are added to the Axis2 service configuration of operations. It should be noted that the XSD expressiveness of Java2WSDL is restricted to simple grammars, i.e., sequence and iteration terms but no choice term. Names for operations and Java classes have been deliberately chosen, so the expressiveness of the auto-generated types in the WSDL exceeds DTD. Furthermore, a randomized client has been implemented for issuing random orders to the service. Attacks have then been generated by manually attacking the VulnShopService and by using the WS-Attacker 1.7 [253] tool for Denial-of-Service and signature wrapping attacks. Raw SOAP messages are then captured by an implemented Axis2 module.

6.4 Performance Results

Using the described datasets, various aspects of performance have been evaluated, i.e., detection performance, learning progress, and operational performance. Furthermore, the applicability of unlearning and sanitization has been tested by seeding random attacks into training data during learning.

6.4.1 Detection Performance To establish a baseline with respect to detection performance, explicit schema validation using Apache Xerces 2.9.1 [24] was performed, and results are listed in Table 6.3. The schemas for the Carsale and Catalog datasets were extracted from ToXGene configura- tions, and simple types were set to XSD datatype string or more specific numeric types if applicable. The VulnShopOrder and VulnShopAuthOrder datasets actually required a col- lection of schemas because several WS-* standards are used, in particular, message body types extracted from the auto-generated WSDL, SOAP schemas, the WS-Addressing schema, WS-Security schemas, and XML digital signature schemas.

Table 6.3: Baseline detection performance using schema validation

Baseline detection performance Dataset Pr Re FPR F1

Carsale 100% 82.35% 0% 90.32% Catalog 100% 76.47% 0% 86.67% VulnShopOrder 100% 50% 0% 66.67% VulnShopAuthOder undef. 0% 0% undef.

The schemas in synthetic datasets are free from extension points, and schema valida- tion achieved high detection performance as anticipated. All attacks that manifest in the syntactical tree structure of a document were identified, and performance could be raised if more concise datatypes are chosen in the schemas. The proposed learner does not have access to schemas and should reach the same or better performance. The baseline detection performance of schema validation in the simulated VulnShopSer- vice painted a different picture. Half of attacks in VulnShopOrder were identified because

128 6.4. PERFORMANCE RESULTS of structural violations or datatype mismatches. All Denial-of-Service attacks passed validation because they appeared at extension points. Furthermore, not a single signature wrapping attack in VulnShopAuthOrder was detectable. Precision and F1 are undefined in the last case in Table 6.3 because of a division by zero. Table 6.4 summarizes the best detection performance results by the proposed learner and validator for lowest parameters k and l. The learner inferred a dXVPA from all normal training examples, the dXVPA was translated into a cXVPA, and a binary confusion matrix was established by counting classification results. The listed performance measures were then computed from the binary confusion matrix. The best parameters were found in a grid search over parameter values k,l 1,...,5 and the two state naming schemes. ∈ { }

Table 6.4: Detection performance highlights

Parameters Detection performance Dataset State naming scheme k l Pr Re FPR F1

Carsale Ancestor based 1 1 100% 100% 0% 100% Catalog Ancestor based 1 1 100% 82.35% 0% 90.32% VulnShopOrder Ancestor based 1 1 100% 92.86% 0% 96.30% VulnShopAuthOrder Ancestor based 1 1 100% 100% 0% 100%

The proposed language-based anomaly detection approach performed well, detection performance exceeded baseline performance, and training data was sufficient in all scenarios. No false alarms were generated, and the best results were already achieved with the simplest parameters, i.e., ancestor-based state naming and k = l = 1. Structural anomalies caused by attacks were reliably detected. It should be stressed that k = l = 1 was a good-enough approximation of the language to identify attacks but more sound types were inferred for l > 1. This difference is visible in the dXVPAs in Figure 6.1 generated for the Carsale dataset. While the dXVPA inferred by parameters k = 1,l = 1 (Figure 6.1a) created an independent module for every observed element name in the training data (DTD expressiveness), the dXVPA learned by parameters k = 1,l = 5 (Figure 6.1a) created two different modules reachable by element ad. With respect to the Carsale dataset, the type of element ad depends on its context, i.e., its ancestor newcars or usedcars, and l > 1 correctly inferred the two different modules. In general, choosing a larger parameter l > 1 did not affect detection performance but required more training and led to a better language approximation.

Undetectable Attacks. Several script and command injection attacks were not iden- tified. All undetected attacks had in common that exploitation code appeared intext content and used CDATA fields to hide special characters, e.g., angled brackets and ampersands, from the XML parser’s lexical analysis. The Unicode representation of these attack-related characters is in the allowed XML character range, so there was no violation of the XML standard. The lexical datatype system based on XSD datatypes is very coarse for strings, and in case of the undetected attacks, the learner inferred a datatype choice, where the least lexical space included the exploitation code, e.g., normalizedString. Improving the lexical datatype system by introducing more fine-grained string datatypes would improve the detection rate for these kinds of attacks.

129 CHAPTER 6. EXPERIMENTAL EVALUATION

����

���������� �� � ������������� �

������������

�� ��������� ������������������������ ����� �����

������������������������ ������������������������������ � � �������������������������� ����

������� ���������

� �� ��������������� ������ ������������

��������������� ������� �������������������������� � ��������������� � �������� � � � �������� ���������������������� �����������

�� ������������� �������� �������� � ������ ���������������������� ������������ ���������������������� �

(a) Ancestor-based states, k = 1,l = 1

����������������� ������������������������ ������������������� � ������������������ ��������������

�������������������������������������� ����� �� ������������������������

� ���������������

����������������������� ��������������� ������������������������������ � � ������ ���������������� ������ ������� � ��������������� � �

���������

������ �������� ��������� ���������������� ��������������� ������

� ����������������������

�� �������������������������

������������������ �������������������

����� ����������������������������

����� ������������������������� ��������������� �

���������������������������������������� ����

�����������������������

������������� � � ����������������������������

(b) Ancestor-based states, k = 1,l = 5

Figure 6.1: Two dXVPAs inferred from dataset Carsale

130 6.4. PERFORMANCE RESULTS

6.4.2 Learning Progress

To understand, how quickly the learner reaches a certain level of detection performance for particular parameters, learning progress was evaluated as follows. The learner started with an intermediate VPA having only a start state and no final states yet. Then, training examples were fed to the learner in random order one by one, the learner incrementally learned, generated a cXVPA, validated all testing documents in the dataset to compute binary classification measures for the particular training iteration, and repeated this procedure. Intuitively, the detection performance stays the same or keeps increasing after every training until convergence is reached because of the strong-monotonicity property of the learner. The mind changes were measured after every incremental learning step as an observable heuristic of learning progress. Because of the random order of training examples, a run was repeated 15 times, and average values for performance measures and mind changes were computed. Figure 6.2 illustrates the fastest converging parameter settings for incremental learning of the four datasets. The final detection performances coincide with the best results in Table 6.4. In the plots, there are three measurements for each learned example: the average F1 score over all random trials, the average FPR, and the average number of mind changes. The F1 score is a combined value of detection rate and precision, and to highlight the convergence of false alarms toward zero, the FPR is also drawn in the plots. The blue and red regions with respect to F1 and FPR illustrate the minimal and maximal values, i.e., the best and worst case from randomly ordered trials. Learner parameters k = 1,l = 2 achieved the fastest convergence for all datasets. The chosen state-naming scheme for the particular experiments was irrelevant because for k = 1, both naming schemes infer congruent automata. Using these parameters, the earliest convergence with respect to detection performance was achieved after learning seven random training examples in dataset Carsale and Catalog. In datasets VulnShopOrder and VulnShopAuthOrder, the best detection performances of F1 = 96.3% and F1 = 100% was already achieved after two randomly chosen training examples. It should be stressed that the presented results are based on randomness, and there still could be orderings that lead to worse performance and slower learning progress compared to the listed experiments.

Convergence of performance measures F1 and FPR does not imply that a correct language representation has been inferred; the learned representation has only reached a good-enough approximation of the acceptable language to distinguish normal documents from documents with structural or text anomalies. Only mind changes are observable in practice, and as visible in all four figures, they become less frequent over time. Along period without mind changes is therefore a heuristic for stabilization. Figure 6.3 demonstrates the effects of increased parameters k and l on the learning progress in dataset VulnShopOrder. Both experiments reached the same detection perfor- mance after learning all training examples. However, compared to the fastest convergence in Figure 6.2c, more learning was necessary. Greater values for l and, in particular, k affect the upper combinatorial bound of possible named states, and more examples are therefore necessary to saturate the state space and transition functions in a dXVPA. More experiments for various parameters are listed in AppendixB.

131 CHAPTER 6. EXPERIMENTAL EVALUATION

100% 100% 80% 80% 60% 60% F1 F1 40% FPR 40% FPR 20% 20% 0% 0% 60 40 MC MC 30 40 20 10 20 0 0 0 5 15 25 35 45 0 20 40 60 80 100 Training iteration Training iteration (a) Carsale, anc.-sib., k = 1,l = 2 (b) Catalog, anc.-sib., k = 1,l = 2

100% 100% 80% 80% 60% 60% F1 F1 40% FPR 40% FPR 20% 20% 0% 0%

80 MC MC 60 200 40 100 20 0 0 0 40 80 120 160 200 0 40 80 120 160 200 Training iteration Training iteration (c) VulnShopOrder, anc.-sib., k = 1,l = 2 (d) VulnShopAuthOrder, anc.-sib., k = 1,l = 2

Figure 6.2: Learning progress highlights

6.4.3 Operational Performance To estimate processing speed, a single large document was learned and validated, and execution times were measured for estimating throughput. The machine was an In- tel Core i5-2400 3.10 GHz, and the example document was an XML dump of the DBLP database [108] (127.66 MB). The measurement process was repeated ten times to eliminate the harddisk as a bottleneck and utilize file caching in memory. The aver- age throughput for learning was 9.46 MBit/s (incremental VPA update, dXVPA output, and cXVPA conversion), and the average throughput for validation was 400.74 MBit/s (cXVPA validation). It should be noted that the learner implementation was not optimized, and computing the minimally required datatypes was the major bottleneck. The dXVPA minimization was, despite its worst-case time complexity, not a practical issue. Learning throughput should be significantly improved by implementing the proposed DFAtf approach for linear-time lexical subsumption computation.

132 6.4. PERFORMANCE RESULTS

100% 100% 80% 80% 60% 60% F1 F1 40% FPR 40% FPR 20% 20% 0% 0%

150 MC 200 MC 100 150 100 50 50 0 0 0 40 80 120 160 200 0 40 80 120 160 200 Training iteration Training iteration (a) Anc.-sib., k = 3,l = 3 (b) Anc.-sib., k = 5,l = 5

Figure 6.3: Effects of increased k and l for dataset VulnShopOrder

6.4.4 Unlearning The performance of learners with and without weights is indistinguishable. However, by introducing weights for states and transitions, the learner gains additional capabilities that are beneficial for anomaly detection, i.e., unlearning and sanitization. Unlearning was evaluated as follows with respect to learning progress. A list of randomly ordered training examples was split into three parts, and the learner started by learning the first part. Then, a set of attacks were fed to the learner tosimulatea successful poisoning attack, and the learner continued to learn the second part of training examples. In the assumed scenario, the poisoning attacks were uncovered, and the learner unlearned them. Lastly, the learner continued with the last part of the training examples. Figure 6.4 illustrates the learning progress for datasets Carsale and VulnShopOrder under the assumed poisoning attack scenario using parameters k = 1,l = 2. The degrada- tion of detection performance started when poisoning started. Unlearning then restored the language representation. It should be noted that knowledge gained in between a poisoning attack and unlearning is kept in the model and therefore not lost. Also the order in which attacks are unlearned does not affect the final outcome. Additional experiments for datasets Catalog and VulnShopAuthOrder are listed in AppendixB.

6.4.5 Sanitization Intuitively, sanitization is a pruning procedure to clean the VPA representation from low-frequent states and transitions according to the weights gathered from training so far. To evaluate sanitization, a single successful poisoning attack was injected after learning 10% of the training examples in random order. After having learned 75% of the training examples, sanitization was performed and learning continued. Figure 6.5 shows the learning progress under the assumed poisoning attack and sanitization scenario for datasets Catalog and VulnShopAuthOrder, where a random attack from test data was chosen for poisoning. In both figures, a small drop of the

133 CHAPTER 6. EXPERIMENTAL EVALUATION

100% 100% 80% 80% 60% 60% F1 F1 40% FPR 40% FPR 20% 20% 0% 0% 40 attacks unlearn MC 80 attacks unlearn MC 30 60 20 40 10 20 0 0 0 10 20 30 40 50 60 70 80 0 40 80 120 180 240 Training iteration Training iteration (a) Carsale, anc.-sib., k = 1,l = 2 (b) VulnShopOrder, anc.-sib., k = 1,l = 2

Figure 6.4: Effects of multiple poisoning attempts and delayed unlearning

F1 score is visible because after poisoning, the particular attack was not identifiable in test data anymore, and the detection rate declined accordingly. The violating states and transitions stayed hidden in the model until sanitization was performed. While for dataset VulnShopAuthOrder, the learner immediately recovered from poisoning after sanitization, some trials on dataset Catalog showed a decline of performance after sanitization. This can be explained as follows. In at least one trial, the learner did not have a stable language representation at the moment of sanitization, good knowledge was also removed, and detection performance declined. After several iterations, the lost knowledge had been recovered in the particular trial, and the best achievable performance was restored. In the theoretical worst case, the lost knowledge from sanatization does not appear in any training example after the operation, and grammatical inference fails to converge to a good-enough approximation. Therefore, sanitization should only be performed if the learner has not changed its mind for a long time. Additional experiments for datasets Carsale and VulnShopOrder are listed in AppendixB.

134 6.4. PERFORMANCE RESULTS

100% 100% 80% 80% 60% 60% F1 F1 40% FPR 40% FPR 20% 20% 0% 0% 60 MC MC 40 attack sanitize 200 attack sanitize 20 100 0 0 0 20 40 60 80 100 0 40 80 120 160 200 Training iteration Training iteration (a) Catalog, anc.-sib., k = 1,l = 2 (b) VulnShopAuthOrder, anc.-sib., k = 1,l = 2

Figure 6.5: Effects of a single poisoning attempt and later sanitization

135 CHAPTER 6. EXPERIMENTAL EVALUATION

136 Chapter 7

Conclusions

This chapter summarizes the contributions in Section 7.1 and analyzes the treatment of research objectives in Section 7.2. Several assumptions, restrictions, and design choices have been made, and Section 7.3 argues their purpose. The results of the experimental evaluation are discussed in detail in Section 7.4, and open questions are finally presented in Section 7.5.

7.1 Summary

The thesis addresses the problem of detecting attacks in insecure client-cloud interaction by monitoring the messages received at a particular interface. According to the LangSec threat model, many attacks exploit software vulnerabilities on both client and service side by manipulating message content such that an interpretation violates security constraints, e.g., by injecting code in some component of the receiver or by claiming higher privileges than allowed. The proposed security monitor implements a countermeasure in its components. A misuse component, which is not further specified, filters messages that contain already known attacks, e.g., identified by pattern matching. The learner and validator components implement the proposed language-based anomaly detection approach which is also the main contribution of this thesis. The learner component infers a representation of the acceptable interface language from training examples, and the validator component applies the inferred representation for syntactically validating incoming messages to identify deviations eventually caused by attacks. XML is the foundation of electronic data exchange in existing and upcoming cloud standards and therefore the subject for learning and validation. The contributions are:

1. dXVPAs and cXVPAs as event-based XML language representations by extending XVPAs with mixed content support

2. a lexical datatype system and algorithms for datatype inference from text contents in an XML document

3. an incremental and set-driven learner that infers a dXVPA from example documents

The language representations and algorithms have been implemented and experimen- tally evaluated in four scenarios, where state-of-the-art XML attacks were simulated.

137 CHAPTER 7. CONCLUSIONS

7.2 Objectives

With respect to Objective 1 and 2, the review of today’s interaction technologies (Chap- ter2) highlights the relevance of XML in modern communication protocols in a cloud context, e.g., SOAP/WS-* web services, XMPP messaging, and XML as a data serializa- tion format in RESTful services. Several technological trends have been identified, in particular, pervasive encryption and preventive measures against deep packet inspection (e.g., certificate pinning), multiplexing in transport mechanisms, and multihoming for utilizing all available physical network connections in mobile devices. These findings challenge traditional attack detection approaches that are also used in cloud computing architectures, e.g., searching for attack patterns in network payloads. A complete view on an interaction is only possible when high-level messages at client or service interfaces are monitored, and message-level monitoring of XML protocols is therefore proposed for following objectives. For Objective 3, the state-of-the-art in anomaly-based intrusion detection and XML attacks are reviewed (Chapter3). XML-based protocols are susceptible to entire classes of security problems, and there is only little research in XML anomaly detection for identifying potential attacks. XML attacks can be distinguished into parsing and semantic attacks: while parsing attacks add or modify syntactic elements in a document to cause Denial-of-Service or other violating behavior in the XML-processing component, seman- tic attacks target the business logic and its components, e.g., by injecting exploit code in text content or rearranging the document structure. XML-based protocols are usually specified in a schema language, and validating a message with respect to itsschemais a first-line defense. However, two fundamental problems have been identified. First, XML is often wrongly treated as a tree-based data model because identity constraints make it far more expressive. Second, the industry standard XSD utilizes extension points for loose composition, and practically all XML-based protocols in cloud computing use them. Extension points enable an attacker to put arbitrary elements in a document without violating the schema. These two problems lead to sophisticated semantic attacks, e.g., the signature wrapping attack. The experimental evaluation confirms the difficulty of preventing signature wrapping attacks: traditional schema validation could not reject a single one. Related work argues that validation is still an effective countermeasure when the schema has no extension points. Eliminating extension points from protocol schemas is computationally hard. Therefore, this thesis proposes a learning-based approach. Objective 4 comprises the major contributions for learning the acceptable language of a particular interface for language-based anomaly detection. The first contribution is an extension of the original XVPA model by Kumar et al. [223] for representing mixed- content XML, where text can occur between a start- and end-tag, between two start-tags, between two end-tags, and between an end- and a start-tag. The proposed learner component incrementally constructs a dXVPA, where every text content is abstracted by a finite set of minimally required datatypes, i.e., a so-called datatype choice. Every dXVPA can be translated into a cXVPA for effective stream validation by bundling datatype choices between pairs of automaton states into efficiently checkable predicates. The second contribution concerns the abstraction of text contents by minimally required datatype choices. The proposed lexical datatype system utilizes existing XSD datatypes and specifies algorithms for determining the minimally required datatypes. Datatypes are partially ordered by lexical subsumption, and the proposed two-stage algorithm computes in the first stage an antichain of the smallest datatypes, with respect

138 7.3. ASSUMPTIONS, RESTRICTIONS, AND DESIGN CHOICES to their lexical spaces, that cover a particular text. The second stage is a preference heuristic that removes ambiguous datatypes from a datatype choice. Moreover, an optimized DFA-based approach for linear-time computation of the minimally required datatypes for a particular text is presented. The third contribution is a set of algorithms for an incremental learner. The learner component maintains an internal automaton which can be incrementally updated. A valid dXVPA is generated from the internal automaton. For the generalization from example documents, the learner exploits the locality of types and simplicity of production rules, usually found in schemas and ad hoc XML formats. Automaton states are named from prefixes of document event streams, and two different naming schemes are presented. Ancestor-based naming constrains the expressiveness to a subset of the XSD-expressible language class, ancestor-sibling-based naming restricts the learnable language class to a subset of the larger 1PPT class. Learning is then achieved by state merging, parameterized by the number of local left siblings (k), the number of local ancestor elements that determine a type (l), and a particular state naming scheme. The parameters need to be chosen beforehand, and k = 1 and l = 2 have shown promising results in the experimental evaluation (Objective 5). The XSD datatypes for datatype inference and choosing the ancestor-based state naming scheme also enables to translate a dXVPA into an XSD. For applicability in an anomaly detection setting and dealing with poisoning attacks in training data, states and transitions in the learner’s intermediate automaton are extended by weights that capture frequencies from learning, and algorithms for unlearning and sanitizing are presented. Unlearning is a desired feature that enables a weight-refined learner to forget a once learned document, e.g., when a poisoning attack is uncovered at a later time. Sanitization trims low-frequent states and transitions from the intermediate automaton for dealing with hidden and rare poisoning attacks. With respect to Objective 5, the proposed security monitor has been evaluated in four scenarios. The first two scenarios were synthetically generated, and the other two scenarios were recorded in a realistic VulnShopService using the Apache Axis2 and Rampart frameworks. State-of-the-art attacks are simulated in all scenarios, and several aspects of the proposed monitor have been evaluated, i.e., detection performance in a binary classification setting, learning progress of the incremental learner in terms ofmind changes, and operational performance by measuring learning and validation speed. The results of the evaluation are discussed in detail in Section 7.4. The architecture of the proposed security monitor is kept abstract so far, and use cases are a middleware security component in our laboratory’s CCIM, an anomaly detection component in an XML firewall, or a client-side browser plug-in for analyzing XML-based resources. The Axis2 framework supports so-called handlers for processing of SOAP messages, and the security monitor could be deployed as a service- or operation-centric handler. Furthermore, a security monitor could be deployed as an individual service; however, the treatment of detected anomalies needs to be clearly specified in this case.

7.3 Assumptions, Restrictions, and Design Choices

In the threat model based on LangSec principles, an attacker is assumed to read and modify XML-based messages on transit and send messages directly to the observed system (Assumption1). Furthermore, the system under observation is assumed to have type- consistent behavior (Assumption2), i.e., for all messages that satisfy a certain acceptable type, the client or service behaves well-specified. The study of actual attacks in Objective

139 CHAPTER 7. CONCLUSIONS

3 supports both assumptions. A learner also assumes that the acceptable language is not random and necessarily has some structure to be discovered (Assumption3). This assumption is a precondition for a service and consequently the learning approach to make sense. Furthermore, several restrictions and design choices have been made over the chapters:

• The learner component only learns from positive examples.

• XML processing instructions, comments, and entity references are ignored.

• XML attributes are ordered.

• XSD datatypes are chosen for inferring datatype choices.

• The XVPA-based representations do not validate integrity constraints.

The most significant design choice is restricting the learning setting to positive examples because learning from both kinds of examples is generally more powerful [178, ch. 11–13]. This decision is motivated by several factors. The existing grammatical inference approaches for informed learning from both kinds of examples typically start by representing exactly the positive examples and keep on generalizing as long as no negative example is accepted [178, ch. 12]. This approach needs a representative sample of negative examples (i.e., XML attacks), but attacks are usually not available to the learner. Attacks are unpredictable, nobody knows all the possible vulnerabilities of a particular service, and the semantic XML attacks studied in Objective 3 are highly service specific. As a first step, this thesis considers learning from positive examples, where deterministic typing and assuming simple content models guide the generalization [50]. Extending the learning mechanism to also include counterexamples for greater expressiveness is an open research question and discussed in Section 7.5. In any case, poisoning during the incremental learning progress is problematic in an adversarial environment, and unlearning and sanitization operations are specified. Document event streams have also been restricted by ignoring events for processing instructions, comments, and entity references in Definition9. This measure has been taken because processing instructions and comments are also ignored in XSD validation [443]. Furthermore, when entity references are enabled, the parser library resolves them during syntactic analysis. But entity references pose the security risk of an entity expansion attack, and best-practice XML security recommends to completely disable inline document type declarations, including entity references [278]. Events from entity references are therefore also ignored. Attributes in XML are typically unordered in schema languages [443, 292], but as a design choice, an order is assumed by encoding attributes as a sequence of startElement, characters, and endElement events to preserve a visibly pushdown language. Treating them like special elements was sufficient for a good-enough language approximation in the experiments. Total unorderedness of attributes in a communication protocol is unlikely because implementations serialize data structures, where data has some order in memory. This recurrent order of attributes despite their allowed unorderedness has been observed in the experiments and not affected the outcome of evaluations. In the worst case of total unorderedness of attributes, more examples would be required for convergence, and exponentially many states and transitions for attributes in a deterministic dXVPA would be introduced. A better representation for attributes is an open research question.

140 7.4. DISCUSSION OF RESULTS

Utilizing XSD datatypes for datatype inference is a design choice motivated by compatibility. According to Theorem1, dXVPAs can be translated into schemas, and when a learner utilizes the ancestor-based naming scheme, an inferred dXVPA can be translated to an XSD. When XSD compatibility is not an issue, the proposed lexical datatype system is flexible and extensible. The ordering is derived from lexical space ≤lex definitions of provided datatypes when the learner initializes the datatype system, and the only necessary precondition is a top datatype that accepts any Unicode string. An ⊤ early stage of this research modeled character data directly in XVPA internal transitions, but this direction was a dead end. There is no knowledge about language classes of text contents in XML, and the texts could be from languages, where grammatical inference is intractable or impossible, e.g., JavaScript and natural languages. The negative effects of this expressiveness gap were observed in a very early prototype utilizing k-testable regular language inference for text contents. Therefore, the research has focused on inference of defined datatypes from texts instead. The XVPA model was chosen as a foundation because it enables linear-time stream validation of document event streams in the SAX/StAX processing model, understanding the automaton requires only little training for a human operator, and in case of a detected anomaly, the location in the document and the causing element or text are identified. But XVPAs and the dXVPA and cXVPA extensions cannot validate XML integrity constraints, e.g., ID-based references, because they can only describe serializations of finite trees. Validation and inference of integrity constraints are not considered in this work because inferring a representation without extension points is sufficient for identifying the state- of-the-art XML attacks that modify document structure. Validating integrity constraints needs more space and time, which affects the overall complexity. Furthermore, inferring integrity constraints like XSD references, foreign keys, and uniqueness is hard because there is no observable syntactic evidence in documents, and some form of rule induction is necessary [29].

7.4 Discussion of Results

The proposed learner and validator components operate on a syntactic level, and inferring a good-enough approximation of the acceptable language is sufficient for distinguishing XML attacks. With respect to Objective 5, the components were experimentally evaluated with promising results, compared to baseline detection rates using traditional schema validation, and the results need further discussion with respect to the challenges in anomaly detection:

• Representativeness and quality of training data. While the synthetic datasets are representative for XML as a data serialization format, the simulated datasets were generated as realistic as possible by implementing the VulnShopService accord- ing to state-of-the-art guidelines and examples. Generating a WSDL definition automatically from an implementation is a reasonable procedure; however, the expressiveness for SOAP message bodies is limited, and convergence is already achieved after a few examples in the experiments. Nevertheless, the purpose of the simulated datasets is to provide complete SOAP messages, including WS- Addressing and WS-Security elements, as received by a particular service operation or endpoint. The proposed learner achieves convergence for all datasets, and unlearning and sanitization also allow dealing with possibly poisoned training data.

141 CHAPTER 7. CONCLUSIONS

• Unstable normality because of system updates and evolution. The interface-centric architecture of the proposed security monitor limits sensitivity to changing envi- ronments compared to, e.g., metadata-based monitoring in networks [486]. The two simulated datasets are operation centric, and the learner infers a language representation of input messages for a particular operation. In a realistic deploy- ment, a learner for every service operation would be necessary, e.g., operations in SOAP/WS-* web services and XML-based resources in RESTful services. In case of an evolving schema, the unlearning capability would allow removal of outdated knowledge from the automaton representation. Training documents or their datatyped event streams need to be stored in this case.

• Sensitivity to false alarms. The grammatical inference approach and the proposed lexical datatype system aim for minimizing detection errors after convergence, and measuring the number of mind changes can serve as heuristic for convergence. For the best experiments in the evaluated datasets, zero false positives were detected. All attacks that affect the acceptable structure were reliably identified. All false negatives were related to coarseness of the lexical datatype system with respect to string-like datatypes. In particular, if the learner has inferred a string-like datatype for a particular document location, basically all Unicode characters are allowed, including the exploit code from the undetected XML- and command-injection attacks.

• Understandability by a human operator. The dXVPA and cXVPA models are intuitive for a human operator with some experience in finite-state machines. If the validator component rejects a document, the exact document position is known. Furthermore, the last active state and its outgoing transitions can explain why a document is not accepted.

• Semantic gap between anomaly and attack. The validator component only decides MEMBERSHIP and cannot reason about attack types if a document has been rejected. However, the position in the document, responsible for the rejection, and the allowed transitions in the last active state are valuable information for a human operator or costly response component, e.g., policy verification.

7.5 Open Questions

Many issues have been addressed in this thesis, but there are still several open questions for future research opportunities.

7.5.1 Extended and Comparative Evaluation The experimental evaluation is a proof of concept and compares the proposed algorithms against traditional schema validation as a baseline. In particular, generating datasets using the VulnShopService could be further extended by integrating more operations and more complex document structures. The client generated randomized SOAP messages using the same library for marshaling a SOAP message, and messages from different libraries would be more conclusive. Implementing the service and client in various frameworks would significantly improve the representativeness of generated datasets. Additional scenarios for an extended evaluation would be: a RESTful service for simulating attacks,

142 7.5. OPEN QUESTIONS

SAML-based authorization and attacks as discovered by Somorovsky et al. [394], and cross-site scripting attacks in a web application context. Chapter3 discusses tree kernels as a modeling approach for geometric anomaly detection in tree-structured data, and a comparative study of both approaches would be of great interest. A potential research opportunity is enriching tree kernels by lexical datatypes for text content in leaf nodes, which would require a similarity function for datatypes, and restricting subtrees to k and l local contexts for efficient counting in the similarity computation between trees.

7.5.2 Modeling and Learning Improvements There are still several open questions in improving the models. In the worst case, al- lowing attributes to be unordered in a deterministic dXVPA or cXVPA would require exponentially many states and transitions in the number of attributes, and a better rep- resentation for unordered attributes would be of use. Also, repetitions inferred by the learner component are unrestricted, and inferring length restrictions would be another potential refinement for a more concise language approximation. Weights for states and transitions are already a step in the direction of quantitative modeling, and a potential extension is probabilistic modeling. This also requires more research in minimization of similar modules which is currently restricted to merging congruent modules. With respect to the lexical datatype system, there are several open research questions. A prototype for the proposed DFAtf approach for efficient datatype inference has already been implemented but not integrated yet. With respect to modeling of datatype choices, every inferred choice has been treated independently so far, and deductions from different datatype choices for the same locations have not been made. For example, if for the same location in document wi the learner infers datatype boolean or unsignedShort, in document w the learner infers for the same location unsignedShort, and φ(boolean) j ∩ φ(unsignedShort) = /0. A logical deduction is that the text content in w and w must be ̸ i j numeric and not datatype boolean. Introducing deduction rules is a potential research opportunity. Enumeration, list, and union datatypes have also not been considered yet. Furthermore, some Unicode symbols have high significance for XML attacks, e.g., angled brackets are necessary for any kind of XML injection because of their semantics in XML syntactic analysis, or the ampersand symbol typically occurs in some forms of SQL or command injection. A preliminary experiment has shown that removing the Unicode symbols <, >, and & from the allowed characters in datatype string and its subtypes would raise the F1 score to 100% in the four datasets. This experiment has not not conclusive yet; however, it is an indicator for research opportunities with respect to inference of more fine-grained string datatypes according to Unicode symbol semantics. In particular, Salama [371] investigates this direction in his master thesis by extending a preliminary version of the lexical datatype system with machine learning methods for inferring approximate datatypes. A general problem for the learner is to choose parameters k and l for finding a trade-off between overgeneralization and a slowly converging learner that needs many training examples. While k = 1 and l = 2 delivered robust results in the experiments, there could be cases, where these parameters overgeneralize and allow attacks to pass through. One research opportunity would be to guide generalization by the misuse detection component which has not been specified yet. In particular, a learned language representation should never accept an already known attack, and if this case occurs,

143 CHAPTER 7. CONCLUSIONS parameters k and l eventually need to be adjusted. Alur and Madhusudan [10] have shown that VPLs are closed under intersection, and emptiness can be decided in PTIME. Assuming that attack signatures are expressed in terms of a dXVPA for misuse detection, then deciding L L = /0is decidable and could guide generalization. Learner ∩ Signatures ̸

7.5.3 Query Learning Negative information is necessary for significantly extending the learnable language class. Security monitoring is a dynamic task, where negative information emerges over time, e.g., when an administrator identifies a previously unseen attack, and expert feedback could be considered a query learning setting. Angluin [16, 17] has defined query learning, where a learner can issue different kinds of queries to a teacher for constructing a language representation, e.g., membership and equivalence queries with respect to the acceptable language. Kumar et al. [222, 221] already present a theoretical result on query learning of modular VPAs, and extending these results to a dXVPA representation should be possible. A conceptual architecture for learning from both positive examples and queries is shown in Figure 7.1. The learner starts from a set of example documents and queries the teacher, i.e., a domain expert, to resolve structural inconsistencies in the current language representation. A VPA or dXVPA is generative, and novel examples for queries can be generated. This form of active learning is hard for discriminative approaches which usually need access to a pool of unlabeled samples for querying. In case of an observed false negative or false positive, the teacher then feeds back a counterexample to the learner, and resolving inconsistencies continues by queries.

Example documents Teacher

XML documents Target Query Learner at interface Automaton Observed error Validator Syntax normal?

Figure 7.1: Learning from positive examples and queries

7.5.4 Architectural Aspects The final open research question is about integration of the security monitor and treatment of detected anomalies. The learner and validator component have been kept abstract with respect to integration so far, and a specification for our laboratory’s CCIM is still open research. The proof of concept in this thesis also considers a binary classification setting, but for actual software, the treatment of an anomaly needs to be further specified. For example, actions could be notifying the receiver to delegate treatment, forwarding to an attack analysis component for risk assessment, triggering of countermeasures, or automatically repairing the message to remove exploit code in the spirit of Krüger et al. [219].

144 Appendix A

XML Schema Datatype Hierarchy

LL L L LKLL KLL LL L LLLL

Figure A.1: XML Schema Definition Language (XSD) 1.1 Part 2: Datatypes [463]

145 APPENDIX A. XML SCHEMA DATATYPE HIERARCHY

146 Appendix B

Additional Experiments

100% 100% 80% 80% 60% 60% F1 F1 40% FPR 40% FPR 20% 20% 0% 0% 60 MC attacks unlearn MC 40 attacks unlearn 200 20 100 0 0 0 20 40 60 80 100 130 0 50 150 250 350 Training iteration Training iteration (a) Unlearn, Catalog, anc.-sib., k = 1,l = 2 (b) Unlearn, VulnShopAuthOrder, anc.-sib., k = 1,l = 2

100% 100% 80% 80% 60% 60% F1 F1 40% FPR 40% FPR 20% 20% 0% 0%

40 attack sanitize MC 80 MC 30 60 attack sanitize 20 40 10 20 0 0 0 5 15 25 35 45 0 40 80 120 160 200 Training iteration Training iteration (c) Sanitize, Carsale, anc.-sib., k = 1,l = 2 (d) Sanitize, VulnShopOrder, anc.-sib., k = 1,l = 2

Figure B.1: Additional unlearning and sanitization experiments

147 APPENDIX B. ADDITIONAL EXPERIMENTS

100% 100% 80% 80% 60% 60% F1 F1 40% FPR 40% FPR 20% 20% 0% 0% 40 MC 40 MC 30 30 20 20 10 10 0 0 0 5 15 25 35 45 0 5 15 25 35 45 Training iteration Training iteration (a) k = 1,l = 1 (b) k = 2,l = 1

100% 100% 80% 80% 60% 60% F1 F1 40% FPR 40% FPR 20% 20% 0% 0% 50 50 40 MC 40 MC 30 30 20 20 10 10 0 0 0 5 15 25 35 45 0 5 15 25 35 45 Training iteration Training iteration (c) k = 2,l = 2 (d) k = 1,l = 3

100% 100% 80% 80% 60% 60% F1 F1 40% FPR 40% FPR 20% 20% 0% 0% 60 50 MC MC 40 30 40 20 20 10 0 0 0 5 15 25 35 45 0 5 15 25 35 45 Training iteration Training iteration (e) k = 2,l = 3 (f) k = 3,l = 3

Figure B.2: Carsale learning progress using ancestor-based states

148 100% 100% 80% 80% 60% 60% F1 F1 40% FPR 40% FPR 20% 20% 0% 0% 40 50 30 MC 40 MC 20 30 20 10 10 0 0 0 5 15 25 35 45 0 5 15 25 35 45 Training iteration Training iteration (a) k = 1,l = 1 (b) k = 2,l = 1

100% 100% 80% 80% 60% 60% F1 F1 40% FPR 40% FPR 20% 20% 0% 0% 50 60 MC 40 MC 40 30 20 20 10 0 0 0 5 15 25 35 45 0 5 15 25 35 45 Training iteration Training iteration (c) k = 2,l = 2 (d) k = 1,l = 3

100% 100% 80% 80% 60% 60% F1 F1 40% FPR 40% FPR 20% 20% 0% 0% 80 100 60 MC 80 MC 40 60 40 20 20 0 0 0 5 15 25 35 45 0 5 15 25 35 45 Training iteration Training iteration (e) k = 2,l = 3 (f) k = 3,l = 3

Figure B.3: Carsale learning progress using ancestor-sibling-based states

149 APPENDIX B. ADDITIONAL EXPERIMENTS

100% 100% 80% 80% 60% 60% F1 F1 40% FPR 40% FPR 20% 20% 0% 0% 60 MC 60 MC 40 40 20 20 0 0 0 20 40 60 80 100 0 20 40 60 80 100 Training iteration Training iteration (a) k = 1,l = 1 (b) k = 2,l = 1

100% 100% 80% 80% 60% 60% F1 F1 40% FPR 40% FPR 20% 20% 0% 0% 60 60 MC MC 40 40 20 20 0 0 0 20 40 60 80 100 0 20 40 60 80 100 Training iteration Training iteration (c) k = 2,l = 2 (d) k = 1,l = 3

100% 100% 80% 80% 60% 60% F1 F1 40% FPR 40% FPR 20% 20% 0% 0% 60 MC 60 MC 40 40 20 20 0 0 0 20 40 60 80 100 0 20 40 60 80 100 Training iteration Training iteration (e) k = 2,l = 3 (f) k = 3,l = 3

Figure B.4: Catalog learning progress using ancestor-based states

150 100% 100% 80% 80% 60% 60% F1 F1 40% FPR 40% FPR 20% 20% 0% 0% 60 MC 60 MC 40 40 20 20 0 0 0 20 40 60 80 100 0 20 40 60 80 100 Training iteration Training iteration (a) k = 1,l = 1 (b) k = 2,l = 1

100% 100% 80% 80% 60% 60% F1 F1 40% FPR 40% FPR 20% 20% 0% 0% 60 60 MC MC 40 40 20 20 0 0 0 5 15 25 35 45 0 20 40 60 80 100 Training iteration Training iteration (c) k = 2,l = 2 (d) k = 1,l = 3

100% 100% 80% 80% 60% 60% F1 F1 40% FPR 40% FPR 20% 20% 0% 0% 80 MC MC 60 60 40 40 20 20 0 0 0 20 40 60 80 100 0 20 40 60 80 100 Training iteration Training iteration (e) k = 2,l = 3 (f) k = 3,l = 3

Figure B.5: Catalog learning progress using ancestor-sibling-based states

151 APPENDIX B. ADDITIONAL EXPERIMENTS

100% 100% 80% 80% 60% 60% F1 F1 40% FPR 40% FPR 20% 20% 0% 0% 100 80 MC 80 MC 60 60 40 40 20 20 0 0 0 40 80 120 160 200 0 40 80 120 160 200 Training iteration Training iteration (a) k = 1,l = 1 (b) k = 2,l = 1

100% 100% 80% 80% 60% 60% F1 F1 40% FPR 40% FPR 20% 20% 0% 0% 100 80 MC 80 MC 60 60 40 40 20 20 0 0 0 40 80 120 160 200 0 40 80 120 160 200 Training iteration Training iteration (c) k = 2,l = 2 (d) k = 1,l = 3

100% 100% 80% 80% 60% 60% F1 F1 40% FPR 40% FPR 20% 20% 0% 0% 100 100 80 MC 80 MC 60 60 40 40 20 20 0 0 0 40 80 120 160 200 0 40 80 120 160 200 Training iteration Training iteration (e) k = 2,l = 3 (f) k = 3,l = 3

Figure B.6: VulnShopOrder learning progress using ancestor-based states

152 100% 100% 80% 80% 60% 60% F1 F1 40% FPR 40% FPR 20% 20% 0% 0%

80 MC 100 MC 80 60 60 40 40 20 20 0 0 0 40 80 120 160 200 0 40 80 120 160 200 Training iteration Training iteration (a) k = 1,l = 1 (b) k = 2,l = 1

100% 100% 80% 80% 60% 60% F1 F1 40% FPR 40% FPR 20% 20% 0% 0%

100 MC 80 MC 60 50 40 20 0 0 0 40 80 120 160 200 0 40 80 120 160 200 Training iteration Training iteration (c) k = 2,l = 2 (d) k = 1,l = 3

100% 100% 80% 80% 60% 60% F1 F1 40% FPR 40% FPR 20% 20% 0% 0% 150 100 MC MC 100 50 50 0 0 0 40 80 120 160 200 0 40 80 120 160 200 Training iteration Training iteration (e) k = 2,l = 3 (f) k = 3,l = 3

Figure B.7: VulnShopOrder learning progress using ancestor-sibling-based states

153 APPENDIX B. ADDITIONAL EXPERIMENTS

100% 100% 80% 80% 60% 60% F1 F1 40% FPR 40% FPR 20% 20% 0% 0% 250 250 200 MC 200 MC 150 150 100 100 50 50 0 0 0 40 80 120 160 200 0 40 80 120 160 200 Training iteration Training iteration (a) k = 1,l = 1 (b) k = 2,l = 1

100% 100% 80% 80% 60% 60% F1 F1 40% FPR 40% FPR 20% 20% 0% 0%

MC MC 200 200 100 100 0 0 0 40 80 120 160 200 0 40 80 120 160 200 Training iteration Training iteration (c) k = 2,l = 2 (d) k = 1,l = 3

100% 100% 80% 80% 60% 60% F1 F1 40% FPR 40% FPR 20% 20% 0% 0% 300 MC MC 200 200 100 100 0 0 0 40 80 120 160 200 0 40 80 120 160 200 Training iteration Training iteration (e) k = 2,l = 3 (f) k = 3,l = 3

Figure B.8: VulnShopAuthOrder learning progress using ancestor-based states

154 100% 100% 80% 80% 60% 60% F1 F1 40% FPR 40% FPR 20% 20% 0% 0% 250 200 MC MC 150 200 100 100 50 0 0 0 40 80 120 160 200 0 40 80 120 160 200 Training iteration Training iteration (a) k = 1,l = 1 (b) k = 2,l = 1

100% 100% 80% 80% 60% 60% F1 F1 40% FPR 40% FPR 20% 20% 0% 0%

300 MC MC 200 200 100 100 0 0 0 40 80 120 160 200 0 40 80 120 160 200 Training iteration Training iteration (c) k = 2,l = 2 (d) k = 1,l = 3

100% 100% 80% 80% 60% 60% F1 F1 40% FPR 40% FPR 20% 20% 0% 0%

300 MC 300 MC 200 200 100 100 0 0 0 40 80 120 160 200 0 40 80 120 160 200 Training iteration Training iteration (e) k = 2,l = 3 (f) k = 3,l = 3

Figure B.9: VulnShopAuthOrder learning progress using ancestor-sibling-based states

155 APPENDIX B. ADDITIONAL EXPERIMENTS

156 Bibliography

[1] van der Aalst, W.M.P., Mooij, A.J., Stahl, C., Wolf, K.: Service interaction: Patterns, formalization, and analysis. In: Formal Methods for Web Services, Lecture Notes in Computer Science, vol. 5569, pp. 42–88. Springer Berlin Heidelberg (2009)

[2] Abadi, M., Budiu, M., Erlingsson, U., Ligatti, J.: Control-flow integrity principles, im- plementations, and applications. ACM Transactions on Information and System Security 13(1), 1–40 (2009)

[3] Abdelhamid, N.: Parallel algorithms for regular expression matching in network intrusion detection systems. Master’s thesis, Johannes Kepler University Linz (2014)

[4] Aceto, G., Botta, A., de Donato, W., Pescapè, A.: Cloud monitoring: A survey. Computer Networks 57(9), 2093–2115 (2013)

[5] Adams, D.: XEP-0030: Service Discovery (2011). URL http://xmpp.org/ extensions/xep-0009.html. Accessed 2014-07-23

[6] Ahonen, H.: Generating grammars for structured documents using grammatical inference methods. Tech. Rep. A-1996-4, Dept. of Computer Science, University of Helsinki, Finland (1996)

[7] Alinone, A.: 10 years of push technology, comet, and websockets (2011). URL http://cometdaily.com/2011/07/06/push-technology-comet-and- websockets-10-years-of-history-from-lightstreamers-perspective/. Accessed 2014-02-17

[8] Alonso, G., Casati, F., Kuno, H.A., Machiraj, V.: Web Services - Concepts, Architectures and Applications. Springer (2004)

[9] Alur, R., Kumar, V., Madhusudan, P., Viswanathan, M.: Congruences for visibly pushdown languages. In: Automata, Languages and Programming, ICALP’05, Lecture Notes in Computer Science, vol. 3580, pp. 1102–1114. Springer Berlin Heidelberg (2005)

[10] Alur, R., Madhusudan, P.: Visibly pushdown languages. In: Proceedings of the 36th Annual ACM Symposium on Theory of Computing, STOC’04, pp. 202–211. ACM (2004)

[11] Amazon Web Services: Amazon Simple Queue Service (Amazon SQS) (2013). URL http://aws.amazon.com/sqs/. Accessed 2014-02-21

[12] Android Developers: Sensors Overview (2014). URL http://developer.android. com/guide/topics/sensors/sensors_overview.html. Accessed 2014-09-09

[13] Anglin, W.S.: The square pyramid puzzle. Am. Math. Monthly 97(2), 120–124 (1990)

[14] Angluin, D.: Inductive inference of formal languages from positive data. Information and Control 45(2), 117–135 (1980)

157 [15] Angluin, D.: Inference of reversible languages. Journal of the ACM 29(3), 741–765 (1982) [16] Angluin, D.: Learning regular sets from queries and counterexamples. Information and Computation 75(2), 87–106 (1987) [17] Angluin, D.: Queries and concept learning. Machine Learning 2(4), 319–342 (1988) [18] Apache Commons: BCEL (2014). URL http://commons.apache.org/proper/ commons-bcel/. Accessed 2014-09-10 [19] Apache Software Foundation: Apache ActiveMQ (2011). URL http://activemq. apache.org/. Accessed 2014-02-21 [20] Apache Software Foundation: OpenWire Version 2 Specification (2011). URL http: //activemq.apache.org/openwire-version-2-specification.html. Accessed 2014-02-20 [21] Apache Software Foundation: Apache Axis2/Java (2012). URL http://axis.apache. org/axis2/java/core/. Accessed 2014-03-28 [22] Apache Software Foundation: Apache Thrift (2012). URL http://thrift.apache. org/. Accessed 2014-02-20 [23] Apache Software Foundation: Apache Etch (2013). URL http://etch.apache.org/. Accessed 2014-02-21 [24] Apache Software Foundation: The Apache Xerces Project (2013). URL http://xerces. apache.org. Accessed 2015-08-25 [25] Apache Software Foundation: Apache Avro 1.7.6 Specification (2014). URL http: //avro.apache.org/docs/1.7.6/spec.html. Accessed 2014-02-21 [26] Apache Software Foundation: Apache Rampart (2014). URL http://axis.apache. org/axis2/java/rampart/. Accessed 2015-08-25 [27] Apache Software Foundation: Kafka (2014). URL http://kafka.apache.org/. Ac- cessed 2014-07-15 [28] Apple: iOS: Multipath TCP Support in iOS 7 (2014). URL http://support.apple. com/kb/HT5977. Accessed 2014-05-22 [29] Arenas, M., Daenen, J., Neven, F., Ugarte, M., Bussche, J.V.D., Vansummeren, S.: Dis- covering XSD keys from XML data. ACM Trans. Database Syst. 39(4), 28:1–28:49 (2014) [30] Arends, R., Austein, R., Larson, M., Massey, D., Rose, S.: DNS Security Introduction and Requirements. RFC 4033 (Proposed Standard) (2005). URL http://www.ietf.org/ rfc/rfc4033.txt. Updated by RFCs 6014, 6840 [31] Ariu, D., Tronci, R., Giacinto, G.: Hmmpayl: An intrusion detection system based on hidden markov models. Computers & Security 30(4), 221–241 (2011) [32] Armbrust, M., Fox, A., Griffith, R., Joseph, A.D., Katz, R., Konwinski, A., Lee, G., Patter- son, D., Rabkin, A., Stoica, I., Zaharia, M.: A view of cloud computing. Communications of the ACM 53(4), 50–58 (2010) [33] Avižienis, A., Laprie, J.C., Randell, B.: Dependability and its threats: A taxonomy. In: Building the Information Society, IFIP International Federation for Information Processing, vol. 156, pp. 91–120. Springer US (2004)

158 [34] Axelsson, S.: The base-rate fallacy and the difficulty of intrusion detection. ACM Trans. Inf. Syst. Secur. 3(3), 186–205 (2000) [35] Barbosa, D., Mendelzon, A., Keenleyside, J., Lyons, K.: ToXgene: A template-based data generator for XML. In: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, SIGMOD’02, pp. 616–616. ACM (2002)

[36] Barreno, M., Nelson, B., Sears, R., Joseph, A.D., Tygar, J.: Can machine learning be secure? In: Proceedings of the 2006 ACM Symposium on Information, Computer and Communications Security, ASIACCS’06, pp. 16–25. ACM (2006)

[37] Barros, A., Dumas, M., ter Hofstede, A.H.: Service interaction patterns. In: Business Process Management, Lecture Notes in Computer Science, vol. 3649, pp. 302–318. Springer Berlin Heidelberg (2005)

[38] Barros, A., Dumas, M., ter Hofstede, A.H.: Service interaction patterns: Towards a reference framework for service-based business process interconnection. Tech. Rep. FIT- TR-2005-02, Faculty of IT, Queensland University of Technology (2005)

[39] Barth, A.: HTTP State Management Mechanism. RFC 6265 (Proposed Standard) (2011). URL http://www.ietf.org/rfc/rfc6265.txt

[40] Barth, A.: The Web Origin Concept. RFC 6454 (Proposed Standard) (2011). URL http://www.ietf.org/rfc/rfc6454.txt

[41] Belshe, M., Peon, R., Thomson, M.: Hypertext Transfer Protocol version 2 (Internet-Draft) (2014). URL https://http2.github.io/http2-spec/. Accessed 2014-04-08

[42] Ben-Kiki, O., Evans, C., döt Net, I.: YAML Ain’t Markup Language (YAML) Version 1.2 (2009). URL http://yaml.org/spec/1.2/spec.html. Accessed 2014-02-20

[43] Bendrath, R., Mueller, M.: The end of the net as we know it? deep packet inspection and internet governance. New Media & Society 13(7), 1142–1160 (2011) [44] Beraka, M., Mathkour, H., Gannouni, S., Hashimi, H.: Applications of different web service composition standards. In: 2012 International Conference on Cloud and Service Computing (CSC), pp. 56–63 (2012)

[45] Berners-Lee, T., Fielding, R., Masinter, L.: Uniform Resource Identifier (URI): Generic Syntax. RFC 3986 (INTERNET STANDARD) (2005). URL http://www.ietf.org/ rfc/rfc3986.txt. Updated by RFCs 6874, 7320

[46] Bernstein, D., Ludvigson, E., Sankar, K., Diamond, S., Morrow, M.: Blueprint for the intercloud - protocols and formats for cloud computing interoperability. In: 4th International Conference on Internet and Web Applications and Services, ICIW’09. IEEE (2009)

[47] Bernstein, D., Vij, D.K.: Draft Standard for Intercloud Interoperability and Federation (SIIF). Tech. Rep. P2302/D0.2, IEEE (2012). URL http://www.intercloudtestbed. org/uploads/2/1/3/9/21396364/intercloud_p2302_draft_0.2.pdf. Accessed 2015-01-16

[48] Bex, G.J., Gelade, W., Neven, F., Vansummeren, S.: Learning deterministic regular expres- sions for the inference of schemas from XML data. ACM Transactions on the Web 4(4), 1–32 (2010)

[49] Bex, G.J., Neven, F., Schwentick, T., Vansummeren, S.: Inference of concise regular expressions and DTDs. ACM Transactions on Database Systems 35(2), 1–47 (2010)

159 [50] Bex, G.J., Neven, F., Van den Bussche, J.: DTDs versus XML Schema: A practical study. In: Proceedings of the 7th International Workshop on the Web and Databases, WebDB’04, pp. 79–84. ACM (2004)

[51] Bex, G.J., Neven, F., Vansummeren, S.: Inferring XML schema definitions from XML data. In: Proceedings of the 33rd International Conference on Very Large Data Bases, VLDB’07, pp. 998–1009. VLDB Endowment (2007)

[52] Bex, G.J., Neven, F., Vansummeren, S.: SchemaScope: A system for inferring and cleaning XML schemas. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD’08, pp. 1259–1262. ACM (2008)

[53] Bhargavan, K., Fournet, C., Gordon, A.D.: A semantics for web services authentication. Theoretical Computer Science 340(1), 102–153 (2005) [54] Bhargavan, K., Fournet, C., Gordon, A.D., O’Shea, G.: An advisor for web services security policies. In: Proceedings of the Workshop on Secure Web Services, SWS’05, pp. 1–9. ACM (2005)

[55] Bilge, L., Dumitras, T.: Before we knew it: an empirical study of zero-day attacks in the real world. In: Proceedings of the 2012 ACM Conference on Computer and Communications Security, CCS’12, pp. 833–844. ACM (2012)

[56] Binder, W., Hulaas, J., Moret, P.: Advanced java bytecode instrumentation. In: Proceedings of the 5th international symposium on Principles and practice of programming in Java, pp. 135–144. ACM (2007)

[57] Birrell, A.D., Nelson, B.J.: Implementing remote procedure calls. ACM Trans. Comput. Syst. 2(1), 39–59 (1984) [58] Biskup, J.: Security in Computing Systems: Challenges, Approaches and Solutions. Springer Berlin Heidelberg (2010)

[59] Bitar, N., Gringeri, S., Xia, T.: Technologies and protocols for data center and cloud networking. IEEE Communications Magazine 51(9), 24–31 (2013) [60] Boggs, N., Hiremagalore, S., Stavrou, A., Stolfo, S.J.: Cross-domain collaborative anomaly detection: So far yet so close. In: Recent Advances in Intrusion Detection, RAID’11, Lecture Notes of Computer Science, vol. 6961, pp. 142–160. Springer Berlin Heidelberg (2011)

[61] Bolzoni, D., Etalle, S., Hartel, P., Zambon, E.: Poseidon: a 2-tier anomaly-based network intrusion detection system. In: 4th IEEE International Workshop on Information Assurance, IWIA’06, pp. 144–156. IEEE (2006)

[62] Bonchi, F., Pous, D.: Hacking nondeterminism with induction and coinduction. Commun. ACM 58(2), 87–95 (2015) [63] Bonetta, D., Peternier, A., Pautasso, C., Binder, W.: S: A scripting language for high- performance restful web services. In: Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP’12, pp. 97–106. ACM (2012)

[64] Börger, E., Stärk, R.: Abstract State Machines: A Method for High-Level System Design and Analysis. Springer-Verlag New York, Inc. (2003)

[65] Bormann, C., Hoffman, P.: Concise Binary Object Representation (CBOR). RFC 7049 (Proposed Standard) (2013). URL http://www.ietf.org/rfc/rfc7049.txt

160 [66] Bova, T., Krivoruchka, T.: Reliable UDP Protocol (1999). URL https://datatracker. ietf.org/doc/draft-lentczner-rhttp/. Accessed 2014-28-05

[67] Braden, R.: Requirements for Internet Hosts - Communication Layers. RFC 1122 (INTER- NET STANDARD) (1989). URL http://www.ietf.org/rfc/rfc1122.txt. Updated by RFCs 1349, 4379, 5884, 6093, 6298, 6633, 6864

[68] Bratus, S.Patterson, M., Shubina, A.: The bugs we have to kill. Usenix ;login: 40(4), 4–10 (2015)

[69] Bratus, S., Darley, T., Locasto, M., Patterson, M., Shapiro, R., Shubina, A.: Beyond planted bugs in “trusting trust”: The input-processing frontier. IEEE Security & Privacy 12(1), 83–87 (2014)

[70] Bratus, S., Darley, T., Locasto, M., Patterson, M., Shapiro, R., Shubina, A.: Beyond planted bugs in “trusting trust”: The input-processing frontier. IEEE Security & Privacy 12(1), 83–87 (2014)

[71] Bray, T.: The JavaScript Object Notation (JSON) Data Interchange Format. RFC 7159 (Proposed Standard) (2014). URL http://www.ietf.org/rfc/rfc7159.txt

[72] Börger, E., Cisternino, A., Gervasi, V.: Contribution to a rigorous analysis of web applica- tion frameworks. In: Abstract State Machines, Alloy, B, VDM, and Z, Lecture Notes in Computer Science, vol. 7316, pp. 1–20. Springer Berlin Heidelberg (2012)

[73] Brüggemann-Klein, A., Wood, D.: One-unambiguous regular languages. Information and Computation 140(2), 229–253 (1998) [74] Bósa, K.: Formal modeling of mobile computing systems based on ambient abstract state machines. In: Semantics in Data and Knowledge Bases, Lecture Notes in Computer Science, vol. 7693, pp. 18–49. Springer Berlin Heidelberg (2013)

[75] Bósa, K.: An ambient ASM model of client-to-client interaction via cloud computing and an anonymously accessible docking service. In: Software Technologies, ICSOFT’13, Revised Selected Papers, Communications in Computer and Information Science, vol. 457, pp. 235–255. Springer Berlin Heidelberg (2014)

[76] Bósa, K., Holom, R.M., Vleju, M.: A formal model of client-cloud interaction. In: Correct Software in Web Applications and Web Services, Texts & Monographs in Symbolic Computation, pp. 83–144. Springer International Publishing (2015)

[77] BSON: Version 1.0 specification (2013). URL http://bsonspec.org/. Accessed 2014- 02-21

[78] Buyya, R., Ranjan, R., Calheiros, R.N.: Intercloud: Utility-oriented federation of cloud computing environments for scaling of application services. In: Algorithms and Architec- tures for Parallel Processing, Lecture Notes in Computer Science, vol. 6081, pp. 13–31. Springer Berlin Heidelberg (2010)

[79] Candle App Platform: Candle markup reference (2013). URL http://www. candlescript.org/doc/candle-markup-reference.htm. Accessed 2014-02-21

[80] Caucho Technology, Inc.: Burlap 1.0 specification (2002). URL http://hessian. caucho.com/doc/burlap-1.0-spec.xtp. Accessed 2014-02-21

[81] Caucho Technology, Inc.: Hessian binary web service protocol (2012). URL http: //hessian.caucho.com. Accessed 2014-02-21

161 [82] Celesti, A., Tusa, F., Villari, M., Puliafito, A.: How to enhance cloud architectures to enable cross-federation. In: 3rd International Conference on Cloud Computing, CLOUD’10, pp. 337–345. IEEE (2010)

[83] Chan-Tin, E., Heorhiadi, V., Hopper, N., Kim, Y.: The frog-boiling attack: Limitations of secure network coordinate systems. ACM Transactions on Information and System Security 14(3), 1–23 (2011)

[84] Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: A survey. ACM Computing Surveys 41(3), 1–58 (2009)

[85] Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection for discrete sequences: A survey. IEEE Transactions on Knowledge and Data Engineering 24(5), 823–839 (2012)

[86] Cheng, Y., Chu, J., Radhakrishnan, S., Jain, A.: TCP Fast Open (2014). URL https: //tools.ietf.org/html/draft-ietf-tcpm-fastopen-06. Accessed 2014-05-23

[87] Cheshire, S., Krochmal, M.: DNS-Based Service Discovery. RFC 6763 (Proposed Stan- dard) (2013). URL http://www.ietf.org/rfc/rfc6763.txt

[88] Chidlovskii, B.: Schema extraction from XML: A grammatical inference approach. In: Pro- ceedings of the 8th International Workshop on Knowledge Representation meets Databases, KRDB’01 (2001)

[89] Chidlovskii, B.: Schema extraction from XML data. Tech. Rep. 2001/200, Xerox Research Center Europe (2001). URL http://www.xrce.xerox.com/Research-Development/ Publications/2001-200. Accessed 2015-06-20

[90] Cisco: Netflow (2014). URL www.cisco.com/go/netflow. Accessed 2013-10-18

[91] Cisco: 2015 annual security report (2015). URL https://www.cisco.com/web/offer/ gist_ty2_asset/Cisco_2015_ASR.pdf. Accessed 2016-04-12

[92] Clark, J.: Trang: Multi-format schema converter based on RELAX NG (2008). URL http://www.thaiopensource.com/relaxng/trang.html. Accessed 2015-04-13

[93] Cloud Security Alliance: The Notorious Nine – Cloud Computing Top Threats in 2013 (2013). URL https://downloads.cloudsecurityalliance.org/initiatives/ top_threats/The_Notorious_Nine_Cloud_Computing_Top_Threats_in_2013. pdf. Accessed 2015-01-21

[94] CloudAMQP: RabbitMQ as a Service (2014). URL http://www.cloudamqp.com. Ac- cessed 2014-07-21

[95] CloudMQTT: Hosted broker for the Internet of Things (2014). URL http://www. cloudmqtt.com. Accessed 2014-07-25

[96] Collins, M., Duffy, N.: Convolution kernels for natural language. In: Advances in Neural Information Processing Systems 14, pp. 625–632. MIT Press (2002)

[97] Comuzzi, M., Kotsokalis, C., Spanoudakis, G., Yahyapour, R.: Establishing and monitoring slas in complex service based systems. In: IEEE International Conference on Web Services, ICWS’09, pp. 783–790. IEEE (2009)

[98] Cormode, G., Krishnamurthy, B.: Key differences between web 1.0 and web 2.0. First Mon- day 13(6) (2008). URL http://firstmonday.org/ojs/index.php/fm/article/ view/2125

162 [99] Corona, I., Ariu, D., Giacinto, G.: HMM-Web: A framework for the detection of attacks against web applications. In: IEEE International Conference on Communications, ICC’09, pp. 1–6. IEEE (2009)

[100] Cova, M., Balzarotti, D., Felmetsger, V., Vigna, G.: Swaddler: An approach for the anomaly-based detection of state violations in web applications. In: Recent Advances in Intrusion Detection – RAID’07, Lecture Notes in Computer Science, vol. 4637, pp. 63–86. Springer Berlin Heidelberg (2007)

[101] Cretu, G.F., Stavrou, A., Locasto, M.E., Stolfo, S.J., Keromytis, A.D.: Casting out demons: Sanitizing training data for anomaly sensors. In: IEEE Symposium on Security and Privacy, S&P’08, pp. 81–95. IEEE (2008)

[102] Criscione, C., Salvaneschi, G., Maggi, F., Zanero, S.: Integrated detection of attacks against browsers, web applications and databases. In: European Conference on Computer Network Defense, EC2ND’09, pp. 37–45. IEEE (2009)

[103] Croll, A., Power, S.: Complete Web Monitoring: Watching your visitors, performance, communities, and competitors. " O’Reilly Media, Inc." (2009)

[104] Curry, E.: Message-oriented middleware. In: Q.H. Mahmoud (ed.) Middleware for Communications. John Wiley & Sons, Ltd, Chichester, UK (2005)

[105] Daboo, C.: CardDAV: vCard Extensions to Web Distributed Authoring and Versioning (WebDAV). RFC 6352 (Proposed Standard) (2011). URL http://www.ietf.org/rfc/ rfc6352.txt. Updated by RFC 6764

[106] Daboo, C., Desruisseaux, B., Dusseault, L.: Calendaring Extensions to WebDAV (CalDAV). RFC 4791 (Proposed Standard) (2007). URL http://www.ietf.org/rfc/rfc4791. txt. Updated by RFCs 5689, 6638, 6764

[107] Davis, J., Goadrich, M.: The relationship between precision-recall and roc curves. In: Proceedings of the 23rd International Conference on Machine Learning, ICML’06, pp. 233–240. ACM (2006)

[108] DBLP: Computer Science Bibliography (2016). URL http://dblp.uni-trier.de/. Accessed 2016-04-16

[109] Debar, H., Dacier, M., Wespi, A.: Towards a taxonomy of intrusion-detection systems. Computer Networks 31(8), 805–822 (1999) [110] Deering, S., Hinden, R.: Internet Protocol, Version 6 (IPv6) Specification. RFC 2460 (Draft Standard) (1998). URL http://www.ietf.org/rfc/rfc2460.txt. Updated by RFCs 5095, 5722, 5871, 6437, 6564, 6935, 6946, 7045, 7112

[111] Delgado, N., Gates, A., Roach, S.: A taxonomy and catalog of runtime software-fault monitoring tools. IEEE Transactions on Software Engineering 30(12), 859–872 (2004) [112] Denning, D.E.: An intrusion-detection model. IEEE Transactions on Software Engineering SE-13(2), 222–232 (1987) [113] Dierks, T., Rescorla, E.: The Transport Layer Security (TLS) Protocol Version 1.2. RFC 5246 (Proposed Standard) (2008). URL http://www.ietf.org/rfc/rfc5246.txt. Updated by RFCs 5746, 5878, 6176

[114] dotCloud: ZeroRPC (2013). URL http://zerorpc.dotcloud.com/. Accessed 2014- 02-19

163 [115] Dubuisson, O., Fouquart, P.: ASN.1: Communication Between Heterogeneous Systems. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (2001)

[116] Dusseault, L.: HTTP Extensions for Web Distributed Authoring and Versioning (WebDAV). RFC 4918 (Proposed Standard) (2007). URL http://www.ietf.org/rfc/rfc4918. txt. Updated by RFC 5689

[117] Düssel, P., Gehl, C., Laskov, P., Rieck, K.: Incorporation of application layer protocol syntax into anomaly detection. In: Information Systems Security – ICISS’08, Lecture Notes of Computer Science, vol. 5352, pp. 188–202. Springer Berlin Heidelberg (2008)

[118] Eddy, W.: TCP SYN Flooding Attacks and Common Mitigations. RFC 4987 (Informational) (2007). URL http://www.ietf.org/rfc/rfc4987.txt

[119] Egele, M., Scholte, T., Kirda, E., Kruegel, C.: A survey on automated dynamic malware- analysis techniques and tools. ACM Computing Surveys 44(2), 1–42 (2012) [120] Endres-Niggemeyer, B.: The mashup ecosystem. In: Semantic Mashups, pp. 1–51. Springer Berlin Heidelberg, Berlin, Heidelberg (2013)

[121] Falkenberg, A., Jensen, M., Schwenk, J.: Welcome to ws-attacks.org (2011). URL http: //www.ws-attacks.org. Accessed 2015-02-05

[122] Feng, H.H., Kolesnikov, O.M., Fogla, P., Lee, W., Gong, W.: Anomaly detection using call stack information. In: IEEE Symposium on Security and Privacy, S&P’03, pp. 62–75. IEEE (2003)

[123] Fernandes, D.A., Soares, L.F., Gomes, J.V., Freire, M.M., Inácio, P.R.: Security issues in cloud environments: a survey. International Journal of Information Security 13(2), 113–170 (2014)

[124] Fernau, H.: Learning XML grammars. In: Machine Learning and Data Mining in Pattern Recognition – MLDM’01, Lecture Notes of Computer Science, vol. 2123, pp. 73–87. Springer Berlin Heidelberg (2001)

[125] Fernau, H.: Identification of function distinguishable languages. Theoretical Computer Science 290(3), 1679–1711 (2003) [126] Fernau, H.: Algorithms for learning regular expressions from positive data. Information and Computation 207(4), 521–541 (2009) [127] Fielding, R., Reschke, J.: Hypertext Transfer Protocol (HTTP/1.1): Message Syntax and Routing. RFC 7230 (Proposed Standard) (2014). URL http://www.ietf.org/rfc/ rfc7230.txt

[128] Fielding, R., Reschke, J.: Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content. RFC 7231 (Proposed Standard) (2014). URL http://www.ietf.org/rfc/rfc7231. txt

[129] Fielding, R.T.: REST: Architectural styles and the design of network-based software architectures. Phd thesis, University of California, Irvine (2000)

[130] Fielding, R.T., Taylor, R.N.: Principled design of the modern web architecture. ACM Trans. Internet Technol. 2(2), 115–150 (2002) [131] Fogla, P., Sharif, M., Perdisci, R., Kolesnikov, O., Lee, W.: Polymorphic blending attacks. In: Proceedings of the 15th Conference on USENIX Security Symposium - Volume 15, USENIX-SS’06. USENIX Association (2006)

164 [132] Ford, A., Raiciu, C., Handley, M., Bonaventure, O.: TCP Extensions for Multipath Operation with Multiple Addresses. RFC 6824 (Experimental) (2013). URL http: //www.ietf.org/rfc/rfc6824.txt

[133] Ford, B.: Structured streams: A new transport abstraction. SIGCOMM Comput. Commun. Rev. 37(4), 361–372 (2007)

[134] Forno, F., Saint-Andre, P.: XEP-0072: SOAP Over XMPP (2005). URL http://xmpp. org/extensions/xep-0072.html. Accessed 2014-07-23

[135] Forrest, S., Hofmeyr, S., Somayaji, A., Longstaff, T.: A sense of self for unix processes. In: IEEE Symposium on Security and Privacy, S&P’96, pp. 120–128. IEEE (1996)

[136] Freed, N., Borenstein, N.: Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies. RFC 2045 (Draft Standard) (1996). URL http: //www.ietf.org/rfc/rfc2045.txt. Updated by RFCs 2184, 2231, 5335, 6532

[137] Freed, N., Borenstein, N.: Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types. RFC 2046 (Draft Standard) (1996). URL http://www.ietf.org/rfc/rfc2046. txt. Updated by RFCs 2646, 3798, 5147, 6657

[138] Freier, A., Karlton, P., Kocher, P.: The Secure Sockets Layer (SSL) Protocol Version 3.0. RFC 6101 (Historic) (2011). URL http://www.ietf.org/rfc/rfc6101.txt

[139] Freydenberger, D.D., Kötzing, T.: Fast learning of restricted regular expressions and DTDs. In: Proceedings of the 16th International Conference on Database Theory, ICDT’13, pp. 45–56. ACM (2013)

[140] Frossi, A., Maggi, F., Rizzo, G., Zanero, S.: Selecting and improving system call models for anomaly detection. In: Detection of Intrusions and Malware, and Vulnerability Assessment, DIMVA’09, Lecture Notes in Computer Science, vol. 5587, pp. 206–223. Springer Berlin Heidelberg (2009)

[141] Furuhashi, S.: MessagePack (2013). URL http://msgpack.org/. Accessed 2014-02-21

[142] Gajek, S., Jensen, M., Liao, L., Schwenk, J.: Analysis of signature wrapping attacks and countermeasures. In: International Conference on Web Services, ICWS’09, pp. 575–582. IEEE (2009)

[143] Gajek, S., Liao, L., Schwenk, J.: Breaking and fixing the inline approach. In: Proceedings of the 2007 ACM Workshop on Secure Web Services, SWS’07, pp. 37–43. ACM (2007)

[144] García, P., Vidal, E.: Inference of k-testable languages in the strict sense and application to syntactic pattern recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 12(9), 920–925 (1990)

[145] Garfinkel, T.: Traps and pitfalls: Practical problems in system call interposition based secu- rity tools. In: Proceedings of the Network and Distributed Systems Security Symposium, NDSS’03, pp. 163–176 (2003)

[146] Garfinkel, T., Rosenblum, M.: A virtual machine introspection based architecture for intrusion detection. In: Proceedings of the Network and Distributed System Security Symposium, NDSS’03 (2003)

[147] Garnock-Jones, T.: Reverse HTTP (2010). URL http://reversehttp.net/reverse- http-spec.html. Accessed 2014-03-04

165 [148] Garofalakis, M., Gionis, A., Rastogi, R., Seshadri, S., Shim, K.: XTRACT: Learning document type descriptors from XML document collections. Data Mining and Knowledge Discovery 7(1), 23–56 (2003)

[149] Garrett, J.J.: AJAX (2005). URL http://www.adaptivepath.com/ideas/ajax-new- approach-web-applications. Accessed 2013-03-27

[150] Gates, C., Taylor, C.: Challenging the anomaly detection paradigm: A provocative discus- sion. In: Proceedings of the 2006 Workshop on New Security Paradigms, NSPW’06, pp. 21–29. ACM (2007)

[151] Gauwin, O., Niehren, J., Roos, Y.: Streaming tree automata. Information Processing Letters 109(1), 13–17 (2008)

[152] Gerhards, R.: The Syslog Protocol. RFC 5424 (Proposed Standard) (2009). URL http: //www.ietf.org/rfc/rfc5424.txt

[153] Gerstenberger, R.: Anomaliebasierte Angriffserkennung im FTP Protokoll. Master’s thesis, University of Potsdam, Germany (2008)

[154] GlassFish Project: (2014). URL https://mq.java.net/. Ac- cessed 2014-02-19

[155] Gold, E.M.: Language identification in the limit. Information and Control 10(5), 447–474 (1967)

[156] Goodloe, A., Pike, L.: Monitoring distributed real-time systems: A survey and fu- ture directions. Tech. Rep. NASA/CR-2010-216724, NASA Langley Research Cen- ter (2010). URL http://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/ 20100027427.pdf. Accessed 2015-02-21

[157] Google Developers: Protocol Buffers (2012). URL https://developers.google.com/ protocol-buffers/. Accessed 2014-02-20

[158] Google Developers: Gadgets API (2013). URL https://developers.google.com/ gadgets/. Accessed 2014-07-08

[159] Google Developers: Geolocation (2014). URL https://developers.google.com/ maps/articles/geolocation. Accessed 2014-09-09

[160] Google Developers: Google APIs Discovery Service (2014). URL https://developers. google.com/discovery/. Accessed 2014-07-18

[161] Google Developers: Google Cloud Pub/Sub (2014). URL https://developers.google. com/pubsub/overview. Accessed 2014-08-07

[162] Google Developers: Using Pull Queues in Java (2014). URL https://developers. google.com/appengine/docs/java/taskqueue/overview-pull. Accessed 2014- 08-07

[163] Google Developers: Using Push Queues in Java (2014). URL https://developers. google.com/appengine/docs/java/taskqueue/overview-push. Accessed 2014- 08-07

[164] Görnitz, N., Kloft, M., Rieck, K., Brefeld, U.: Active learning for network intrusion detec- tion. In: Proceedings of the 2nd ACM Workshop on Security and Artificial Intelligence, AISec’09, pp. 47–54. ACM (2009)

166 [165] Görnitz, N., Kloft, M., Rieck, K., Brefeld, U.: Toward supervised anomaly detection. J. Artif. Int. Res. 46(1), 235–262 (2013)

[166] Gottschalk, K., Graham, S., Kreger, H., Snell, J.: Introduction to web services architecture. IBM Systems Journal 41(2), 170–177 (2002)

[167] Gregg, B.: Dtracetoolkit (2007). URL http://www.brendangregg.com/ dtracetoolkit.html. Accessed 2015-02-18

[168] Gregorio, J., de hOra, B.: The Atom Publishing Protocol. RFC 5023 (Proposed Standard) (2007). URL http://www.ietf.org/rfc/rfc5023.txt

[169] Grijzenhout, S., Marx, M.: The quality of the XML web. In: Proceedings of the 20th ACM International Conference on Information and Knowledge management, CIKM’11, pp. 1719–1724. ACM (2011)

[170] Grozev, N., Buyya, R.: Inter-cloud architectures and application brokering: taxonomy and survey. Software: Practice and Experience 44(3), 369–390 (2014)

[171] Gruschka, N., Iacono, L.: Vulnerable cloud: SOAP message security validation revisited. In: International Conference on Web Services, ICWS’09, pp. 625–631. IEEE (2009)

[172] Hadžiosmanovic,´ D., Simionato, L., Bolzoni, D., Zambon, E., Etalle, S.: N-gram against the machine: On the feasibility of the n-gram network analysis for binary protocols. In: Research in Attacks, Intrusions, and Defenses, RAID’12, Lecture Notes in Computer Science, vol. 7462, pp. 354–373. Springer Berlin Heidelberg (2012)

[173] Handley, M., Paxson, V., Kreibich, C.: Network intrusion detection: Evasion, traffic normalization, and end-to-end protocol semantics. In: Proceedings of the USENIX Security Symposium, SECURITY’01. USENIX Association (2001)

[174] Harmeling, S., Dornhege, G., Tax, D., Meinecke, F., Müller, K.R.: From outliers to prototypes: Ordering data. Neurocomputing 69(13–15), 1608 – 1618 (2006)

[175] Harrington, D., Presuhn, R., Wijnen, B.: An Architecture for Describing Simple Network Management Protocol (SNMP) Management Frameworks. RFC 3411 (INTERNET STAN- DARD) (2002). URL http://www.ietf.org/rfc/rfc3411.txt. Updated by RFCs 5343, 5590

[176] Hegewald, J., Naumann, F., Weis, M.: XStruct: Efficient schema extraction from multiple and large XML documents. In: 22nd International Conference on Data Engineering Workshops, ICDEW’06, pp. 81–81. IEEE (2006)

[177] de la Higuera, C.: Characteristic sets for polynomial grammatical inference. Machine Learning 138, 125–138 (1997)

[178] de la Higuera, C.: Grammatical Inference: Learning Automata and Grammars. Cambridge University Press (2010)

[179] Hildebrand, J., Millard, P., Eatmon, R., Saint-Andre, P.: XEP-0009: Jabber-RPC (2008). URL http://xmpp.org/extensions/xep-0030.html. Accessed 2014-07-23

[180] Hildebrand, J., Saint-Andre, P.: XEP-0033: Extended Stanza Addressing (2004). URL http://xmpp.org/extensions/xep-0033.html. Accessed 2014-07-23

[181] Hodges, J., Jackson, C., Barth, A.: HTTP Strict Transport Security (HSTS). RFC 6797 (Proposed Standard) (2012). URL http://www.ietf.org/rfc/rfc6797.txt

167 [182] Hofmeyr, S.A., Forrest, S., Somayaji, A.: Intrusion detection using sequences of system calls. Journal of Computer Security 6(3), 151–180 (1998) [183] Hofstede, R., Drago, I., Sperotto, A., Pras, A.: Flow monitoring experiences at the ethernet- layer. In: Energy-Aware Communications – EUNICE’11, Lecture Notes in Computer Science, vol. 6955, pp. 134–145. Springer Berlin Heidelberg (2011)

[184] Hohpe, G., Woolf, B.: Enterprise Integration Patterns: Designing, Building, and Deploying Messaging Solutions. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA (2003)

[185] Hopcroft, J.E., Ullman, J.D.: Introduction to Automata Theory, Languages and Computa- tion, Second Edition. Addison-Wesley (2000)

[186] Hornsby, A., Walsh, R.: From instant messaging to cloud computing, an xmpp review. In: IEEE 14th International Symposium on Consumer Electronics, ISCE’10, pp. 1–6. IEEE (2010)

[187] Huang, L., Joseph, A.D., Nelson, B., Rubinstein, B.I., Tygar, J.D.: Adversarial machine learning. In: Proceedings of the 4th ACM Workshop on Security and Artificial Intelligence, AISec’11, pp. 43–58. ACM (2011)

[188] Huang, L.S., Rice, A., Ellingsen, E., Jackson, C.: Analyzing forged ssl certificates in the wild. In: IEEE Symposium on Security and Privacy, S&P’14. IEEE (2014)

[189] IBM Developer Networks: MQ Telemetry Transport (MQTT) V3.1 Protocol Specification (2010). URL http://www.ibm.com/developerworks/webservices/library/ws- /. Accessed 2014-02-20

[190] iMatix Corporation: /0MQ (2013). URL http://www.zeromq.org/. Accessed 2014-02- 19

[191] Ingham, K.L., Somayaji, A., Burge, J., Forrest, S.: Learning DFA representations of HTTP for protecting web applications. Computer Networks 51(5), 1239–1255 (2007) [192] Internet Assigned Numbers Authority: Website (2014). URL https://www.iana.org/. Accessed 2014-07-01

[193] Ippolito, B.: Remote JSON - JSONP (2005). URL http://bob.ippoli.to/archives/ 2005/12/05/remote-json-jsonp/. Accessed 2014-07-01

[194] Iron.io: IronMQ (2014). URL http://www.iron.io/mq. Accessed 2014-02-20

[195] ISO: ISO/IEC 23001-1:2006 Information technology – MPEG systems technologies – Part 1: Binary MPEG format for XML (2006). URL http://www.iso.org/iso/iso_ catalogue/catalogue_tc/catalogue_detail.htm?csnumber=35417. Accessed 2014-06-16

[196] ITU: X.891: Information technology - Generic applications of ASN.1: Fast infoset (2005). URL http://www.itu.int/rec/T-REC-X.891-200505-I/en. Accessed 2014-06-16

[197] Jaakkola, H., Thalheim, B.: Exception-aware (information) systems. In: Information Modelling and Knowledge Bases XXIV, Frontiers in Artificial Intelligence and Applications, vol. 251, pp. 300–313. IOS Press (2013)

[198] Jain, S., Osherson, D., Royer, J.S., Sharma, A.: Systems that learn: an introduction to learning theory, second edition. Learning, development, and conceptual change. A Bradford Book, Cambridge, Mass. MIT Press (1999)

168 [199] Jantke, K.s.: Monotonic and non-monotonic inductive inference. New Generation Comput- ing 8(4), 349–360 (1991)

[200] Java Community Process: JSR 173: Streaming API for XML (2013). URL https: //www.jcp.org/en/jsr/detail?id=173. Accessed 2015-02-16

[201] Jayashree, K., Anand, S.: Web service diagnoser model for managing faults in web services. Computer Standards & Interfaces 36(1), 154–164 (2013)

[202] Jensen, M., Gruschka, N., Herkenhöner, R.: A survey of attacks on web services. Computer Science - Research and Development 24(4), 185–197 (2009)

[203] Jensen, M., Liao, L., Schwenk, J.: The curse of namespaces in the domain of XML signature. In: Proceedings of the 2009 ACM Workshop on Secure Web Services, SWS’09, pp. 29–36. ACM (2009)

[204] Jensen, M., Meyer, C., Somorovsky, J., Schwenk, J.: On the effectiveness of XML schema validation for countering XML signature wrapping attacks. In: 1st International Workshop on Securing Services on the Cloud, IWSSC’11, pp. 7–13. IEEE (2011)

[205] Josefsson, S.: The Base16, Base32, and Base64 Data Encodings. RFC 4648 (Proposed Standard) (2006). URL http://www.ietf.org/rfc/rfc4648.txt

[206] Kallin, J., Valbuena, I.L.: Excess XSS, a comprehensive tutorial on cross-site scripting (2013). URL http://excess-xss.com/. Accessed 2015-03-23

[207] Khronos Group: WebGL Specification Version 1.0.2 (2013). URL https://www. khronos.org/registry/webgl/specs/1.0/. Accessed 2014-03-28

[208] Khronos Group: WebCL Specification Version 1.0 (2014). URL http://www.khronos. org/registry/webcl/specs/1.0.0/. Accessed 2014-03-28

[209] Kirchner, M.: A framework for detecting anomalies in http traffic using instance-based learning and k-nearest neighbor classification. In: 2nd International Workshop on Security and Communication Networks, IWSCN’10, pp. 1–8. IEEE (2010)

[210] Kiselyov, O., Lisovsky, K.: Xml, xpath, xslt implementations as sxml, sxpath, and sxslt (2002). URL http://okmij.org/ftp/papers/SXs.pdf. Accessed 2014-04-25

[211] Klensin, J.: Simple Mail Transfer Protocol. RFC 5321 (Draft Standard) (2008). URL http://www.ietf.org/rfc/rfc5321.txt

[212] Ko, C., Fink, G., Levitt, K.: Automated detection of vulnerabilities in privileged programs by execution monitoring. In: 10th Annual Computer Security Applications Conference, ACSAC’94, pp. 134–144. IEEE (1994)

[213] Ko, C., Ruschitzka, M., Levitt, K.: Execution monitoring of security-critical programs in distributed systems: a specification-based approach. In: IEEE Symposium on Security and Privacy, S&P’97, pp. 175–187. IEEE (1997)

[214] Kohler, E., Handley, M., Floyd, S.: Datagram Congestion Control Protocol (DCCP). RFC 4340 (Proposed Standard) (2006). URL http://www.ietf.org/rfc/rfc4340.txt. Updated by RFCs 5595, 5596, 6335, 6773

[215] Kosala, R., Blockeel, H., Bruynooghe, M., Van den Bussche, J.: Information extraction from structured documents using k-testable tree automaton inference. Data & Knowledge Engineering 58(2), 129–158 (2006)

169 [216] Kozen, D.C.: Automata and Computability, 1st edn. Springer-Verlag New York, Inc., Secaucus, NJ, USA (1997)

[217] Kruegel, C., Toth, T., Kirda, E.: Service specific anomaly detection for network intrusion detection. In: Proceedings of the 2002 ACM Symposium on Applied Computing, SAC’02, pp. 201–208. ACM (2002)

[218] Kruegel, C., Vigna, G.: Anomaly detection of web-based attacks. In: Proceedings of the 10th ACM Conference on Computer and Communication Security, CCS’03, pp. 251–261. ACM (2003)

[219] Krüger, T., Gehl, C., Rieck, K., Laskov, P.: Tokdoc: A self-healing web application firewall. In: Proceedings of the 2010 ACM Symposium on Applied Computing, SAC’10, pp. 1846–1853. ACM (2010)

[220] Krüger, T., Krämer, N., Rieck, K.: Asap: Automatic semantics-aware analysis of net- work payloads. In: Privacy and Security Issues in Data Mining and Machine Learning, PSDML’10, Lecture Notes of Computer Science, vol. 6549, pp. 50–63. Springer Berlin Heidelberg (2011)

[221] Kumar, V., Madhusudan, P., Viswanathan, M.: Minimization, learning, and conformance testing of boolean programs. In: CONCUR 2006 – Concurrency Theory, Lecture Notes of Computer Science, vol. 4137, pp. 203–217. Springer Berlin Heidelberg (2006)

[222] Kumar, V., Madhusudan, P., Viswanathan, M.: Minimization, learning, and conformance testing of boolean programs. Tech. Rep. UIUCDCS-R-2006-2736, University of Illinois at Urbana-Champaign (2006). URL http://hdl.handle.net/2142/11210. Accessed 10-4-2015

[223] Kumar, V., Madhusudan, P., Viswanathan, M.: Visibly pushdown automata for streaming XML. In: Proceedings of the 16th International Conference on World Wide Web, WWW’07, pp. 1053–1062. ACM (2007)

[224] Lakhina, A., Crovella, M., Diot, C.: Mining anomalies using traffic feature distributions. In: Proceedings of the 2005 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, SIGCOMM’05, pp. 217–228. ACM (2005)

[225] Lam, T.C., Ding, J.J., Liu, J.C.: XML document parsing: Operational and performance characteristics. Computer 41(9), 30–37 (2008)

[226] Lampesberger, H.: Technologies for web and cloud service interaction: A survey. Service Oriented Computing and Applications (2015). DOI 10.1007/s11761-015-0174-1

[227] Lampesberger, H.: An incremental learner for language-based anomaly detection XML. In: 2016 IEEE Security and Privacy Workshops, LangSec’16. IEEE (2016). (To appear)

[228] Lampesberger, H., Winter, P., Zeilinger, M., Hermann, E.: An on-line learning statistical model to detect malicious web requests. In: Security and Privacy in Communication Networks, Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol. 96, pp. 19–38. Springer Berlin Heidelberg (2012)

[229] Lampesberger, H., Zeilinger, M., Hermann, E.: Statistical modeling of web requests for anomaly detection in web applications. In: Advances in IT Early Warning, pp. 91–101. Fraunhofer Verlag Stuttgart (2013)

170 [230] Lang, K.J., Pearlmutter, B.A., Price, R.A.: Results of the abbadingo one DFA learning competition and a new evidence-driven state merging algorithm. In: Grammatical Inference, Lecture Notes in Computer Science, vol. 1433, pp. 1–12. Springer Berlin Heidelberg (1998)

[231] Langley, A., Chang, W.T.: QUIC Crypto (2013). URL https://docs.google. com/document/d/1g5nIXAIkN_Y-7XJW5K45IblHd_L2f5LTaDUDwvZ5L6g/. Accessed 2014-05-26

[232] Larzon, L.A., Degermark, M., Pink, S., Jonsson, L.E., Fairhurst, G.: The Lightweight User Datagram Protocol (UDP-Lite). RFC 3828 (Proposed Standard) (2004). URL http://www.ietf.org/rfc/rfc3828.txt. Updated by RFC 6335

[233] Lazarevic, A., Kumar, V., Srivastava, J.: Intrusion detection: A survey. In: Managing Cyber Threats, Massive Computing, vol. 5, pp. 19–78. Springer US (2005)

[234] ldv_alt: Project page: strace (2013). URL http://freecode.com/projects/strace. Accessed 2013-10-18

[235] Leech, M., Ganis, M., Lee, Y., Kuris, R., Koblas, D., Jones, L.: SOCKS Protocol Version 5. RFC 1928 (Proposed Standard) (1996). URL http://www.ietf.org/rfc/rfc1928. txt

[236] Lengyel, E.: Open Data Description Language (OpenDDL). URL http://openddl.org/ (2013). Accessed 2014-02-21

[237] Leucker, M., Schallhart, C.: A brief account of runtime verification. The Journal of Logic and Algebraic Programming 78(5), 293–303 (2009)

[238] Liang, P., Jordan, M.I.: An asymptotic analysis of generative, discriminative, and pseudo- likelihood estimators. In: Proceedings of the 25th International Conference on Machine Learning, ICML’08, pp. 584–591. ACM (2008)

[239] Lindsay, J.: Web hooks to revolutionize the web (2007). URL http://progrium.com/ blog/2007/05/03/web-hooks-to-revolutionize-the-web/. Accessed 2014-02- 20

[240] Lindsay, J.: From webhooks to the evented web (2012). URL http://progrium.com/ blog/2012/11/19/from-webhooks-to-the-evented-web/. Accessed 2014-072-02

[241] Lindsay, J., Shakirzyanov, B.: NullMQ (2014). URL https://github.com/progrium/ nullmq. Accessed 2014-03-10

[242] Lindsey, J.: Subtyping in W3C XML Schema, part 1. The Data Administration Newsletter (2008). URL http://www.tdan.com/view-articles/7185. Accessed 2015-03-23

[243] Lindsey, J.: Subtyping in W3C XML Schema, part 2. The Data Administration Newsletter (2008). URL http://www.tdan.com/view-articles/7186. Accessed 2015-03-23

[244] Magazinius, J.: Securing the mashed up web. Phd thesis, Chalmers University of Technol- ogy, Göteborg (2013)

[245] Magazinius, J., Hedlin, D., Sabelfeld, A.: Architectures for inlining security monitors in web applications. In: International Symposium on Engineering Secure Software and Systems, ESSoS’14. Springer Berlin Heidelberg (2014)

[246] Magazinius, J., Russo, A., Sabelfeld, A.: On-the-fly inlining of dynamic security monitors. Computers & Security 31(7), 827–843 (2012)

171 [247] Maggi, F., Matteucci, M., Zanero, S.: Detecting intrusions through system call sequence and argument analysis. IEEE Transactions on Dependable and Secure Computing 7(4), 381–395 (2010) [248] Maggi, F., Robertson, W., Kruegel, C., Vigna, G.: Protecting a moving target: Addressing web application concept drift. In: Recent Advances in Intrusion Detection, RAID’09, Lecture Notes of Computer Science, vol. 5758, pp. 21–40. Springer Berlin Heidelberg (2009) [249] Maggi, F., Zanero, S.: Is the future web more insecure? distractions and solutions of new-old security issues and measures. In: 2nd Worldwide Cybersecurity Summit, WCS’11, pp. 1–9. IEEE (2011) [250] Mahoney, M.V.: Network traffic anomaly detection based on packet bytes. In: Proceedings of the 2003 ACM Symposium on Applied computing, SAC’03, pp. 346–350. ACM (2003) [251] Mahoney, M.V., Chan, P.K.: Learning nonstationary models of normal network traffic for detecting novel attacks. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’02, pp. 376–385. ACM (2002) [252] Mainka, C., Jensen, M., Iacono, L.L., Schwenk, J.: XSpRES - robust and effective XML signatures for web services. In: Proceedings of the 2nd International Conference on Cloud Computing and Services Science, CLOSER’12, pp. 187–197. SciTePress (2012) [253] Mainka, C., Somorovsky, J., Schwenk, J.: Penetration testing tool for web services security. In: Eighth World Congress on Services, SERVICES’12, pp. 163–170. IEEE (2012) [254] Martens, W., Neven, F., Schwentick, T., Bex, G.J.: Expressiveness and complexity of XML Schema. ACM Transactions on Database Systems 31(3), 770–813 (2006) [255] McIntosh, M., Austel, P.: XML signature element wrapping attacks and countermeasures. In: Proceedings of the 2005 Workshop on Secure Web Services, SWS’05, pp. 20–27. ACM (2005) [256] Mell, P.M., Grance, T.: The NIST Definition of Cloud Computing. Tech. Rep. SP 800-145, National Institute of Standards & Technology (2011). URL http://csrc.nist.gov/ publications/nistpubs/800-145/SP800-145.pdf. Accessed 2015-01-19 [257] Menahem, E., Schclar, A., Rokach, L., Elovici, Y.: Securing your transactions: Detecting anomalous patterns in XML documents. arXiv CoRR abs/1209.1797 (2012). URL http: //arxiv.org/abs/1209.1797

[258] Menahem, E., Schclar, A., Rokach, L., Elovici, Y.: XML-AD: Detecting anomalous patterns in XML documents. Information Sciences (2015). DOI 10.1016/j.ins.2015.07.007 [259] Mercer, J.: Functions of positive and negative type, and their connection with the theory of integral equations. Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences 209(441-458), 415–446 (1909) [260] Michael, C.C., Ghosh, A.: Simple, state-based approaches to program-based anomaly detection. ACM Transactions on Information and System Security 5(3), 203–237 (2002) [261] microformats wiki: microformats 2 (2014). URL http://microformats.org/wiki/ microformats2. Accessed 2014-07-02 [262] Microsoft Azure: Azure Queues and Service Bus Queues - Compared and Contrasted (2014). URL http://msdn.microsoft.com/library/azure/Hh767287.aspx. Ac- cessed 2014-07-21

172 [263] Microsoft Developer Network: Sending files, attachments, and SOAP messages via direct internet message encapsulation (2002). URL http://msdn.microsoft.com/en-us/ magazine/cc188797.aspx. Accessed 2014-07-15

[264] Microsoft Developer Network: COM+ (Component Services) (2014). URL http: //msdn.microsoft.com/en-us/library/windows/desktop/ms685978.aspx. Ac- cessed 2014-07-08

[265] Microsoft Developer Network: Message Queuing (MSMQ) (2014). URL http://msdn. microsoft.com/en-us/library/ms711472.aspx. Accessed 2014-02-20

[266] Microsoft Developer Network: .NET Binary Format: XML Data Structure (2014). URL http://msdn.microsoft.com/en-us/library/cc219210.aspx. Accessed 2014-07- 09

[267] Microsoft Developer Network: SQL Server Service Broker (2014). URL http://msdn. microsoft.com/en-us/library/bb522893.aspx. Accessed 2014-02-23

[268] Microsystems, S.: XDR: External Data Representation standard. RFC 1014 (1987). URL http://www.ietf.org/rfc/rfc1014.txt

[269] Millard, P., Saint-Andre, P., Meijer, R.: XEP-0060: Publish-Subscribe (2010). URL http://xmpp.org/extensions/xep-0060.html. Accessed 2014-07-23

[270] Miller, M., Saint-Andre, P.: XEP-0079: Advanced Message Processing (2005). URL http://xmpp.org/extensions/xep-0079.html. Accessed 2014-07-23

[271] Min, J.K., Ahn, J.Y., Chung, C.W.: Efficient extraction of schemas for XML documents. Information Processing Letters 85(1), 7–12 (2003) [272] Mitchell, S.: Ask a cloud networking expert: Why is multicast disabled in the cloud? how can you re-enable udp multicast? (2014). URL http://blog.cohesiveft.com/2014/ 04/ask-cloud-networking-expert-why-is.html. Accessed 2014-12-07

[273] Mlýnková, I., Necaský,ˇ M.: Towards inference of more realistic xsds. In: Proceedings of the 2009 ACM Symposium on Applied Computing, SAC’09, pp. 639–646. ACM (2009)

[274] Mlýnková, I., Necaský,ˇ M.: Heuristic methods for inference of XML schemas: Lessons learned and open issues. Informatica 24(4), 577–602 (2013) [275] Mockapetris, P.: Domain names - implementation and specification. RFC 1035 (INTER- NET STANDARD) (1987). URL http://www.ietf.org/rfc/rfc1035.txt. Updated by RFCs 1101, 1183, 1348, 1876, 1982, 1995, 1996, 2065, 2136, 2181, 2137, 2308, 2535, 2673, 2845, 3425, 3658, 4033, 4034, 4035, 4343, 5936, 5966, 6604

[276] Modi, C., Patel, D., Borisaniya, B., Patel, H., Patel, A., Rajarajan, M.: A survey of intrusion detection techniques in cloud. Journal of Network and Computer Applications 36(1), 42–57 (2013)

[277] Møller, A.: dk.brics.automaton – finite-state automata and regular expressions for Java (2010). URL http://www.brics.dk/automaton/. Accessed 2015-08-07

[278] Morgan, T.D., Ibrahim, O.A.: Xml schema, dtd, and entity attacks. Tech. rep., Virtual Se- curity Research, LLC (2014). URL http://www.vsecurity.com/download/papers/ XMLDTDEntityAttacks.pdf. Accessed 2015-03-16

[279] Morley, M.: JSON-RPC 2.0 Specification (2013). URL http://www.jsonrpc.org/ specification. Accessed 2014-02-20

173 [280] Murata, M., Lee, D., Mani, M., Kawaguchi, K.: Taxonomy of XML schema languages using formal language theory. ACM Transactions on Internet Technology 5(4), 660–704 (2005)

[281] Mutz, D., Valeur, F., Vigna, G., Kruegel, C.: Anomalous system call detection. ACM Transactions on Information and System Security 9(1), 61–93 (2006) [282] Nance, K., Bishop, M., Hay, B.: Virtual machine introspection: Observation or interference? IEEE Security & Privacy Magazine 6(5), 32–37 (2008) [283] Natarajan, P., Amer, P.D., Stewart, R.: Multistreamed web transport for developing regions. In: Proceedings of the 2nd ACM SIGCOMM Workshop on Networked Systems for Developing Regions, NSDR’08, pp. 43–48. ACM (2008)

[284] Necula, G.C., McPeak, S., Rahul, S., Weimer, W.: Cil: Intermediate language and tools for analysis and transformation of c programs. In: Compiler Construction, Lecture Notes in Computer Science, vol. 2304, pp. 213–228. Springer Berlin Heidelberg (2002)

[285] Nethercote, N., Seward, J.: Valgrind: A framework for heavyweight dynamic binary instrumentation. SIGPLAN Not. 42(6), 89–100 (2007) [286] Neven, F.: Automata, logic, and XML. In: Computer Science Logic, CSL’02, Lecture Notes of Computer Science, vol. 2471, pp. 671–711. Springer Berlin Heidelberg (2002)

[287] Niemi, O.p., Levomäki, A., Manner, J.: Dismantling intrusion prevention systems. ACM SIGCOMM Computer Communication Review 42(4), 285–286 (2012) [288] Nottingham, M., Sayre, R.: The Atom Syndication Format. RFC 4287 (Proposed Standard) (2005). URL http://www.ietf.org/rfc/rfc4287.txt. Updated by RFC 5988

[289] Nusayr, A., Cook, J.: Extending AOP to support broad runtime monitoring needs. Software Engineering and Knowledge Engineering pp. 438–441 (2009)

[290] Nusayr, A., Cook, J.: Using aop for detailed runtime monitoring instrumentation. In: Proceedings of the Seventh International Workshop on Dynamic Analysis, WODA’09, pp. 8–14. ACM (2009)

[291] Nychis, G., Sekar, V., Andersen, D.G., Kim, H., Zhang, H.: An empirical evaluation of entropy-based traffic anomaly detection. In: Proceedings of the 8th ACM SIGCOMM Conference on Internet Measurement, IMC’08, pp. 151–156. ACM (2008)

[292] OASIS: RELAX NG Specification (2001). URL https://www.oasis-open.org/ committees/relax-ng/spec.html. Accessed 2014-07-14

[293] OASIS: UDDI Version 3.0.2 (2004). URL https://www.oasis-open.org/ committees/uddi-spec/doc/spec/v3/uddi-v3.0.2-20041019.htm. Accessed 2014-02-21

[294] OASIS: OASIS Web Services Notification (WSN) TC (2006). URL https://www.oasis- open.org/committees/tc_home.php?wg_abbrev=wsn. Accessed 2014-07-14

[295] OASIS: Web Services Security: SOAP Message Security 1.1 (WS-Security 2004) (2006). URL http://docs.oasis-open.org/wss/v1.1/wss-v1.1-spec-errata- -SOAPMessageSecurity.htm. Accessed 2014-03-03

[296] OASIS: Web Services Business Process Execution Language Version 2.0 (2007). URL http://docs.oasis-open.org/wsbpel/2.0/OS/wsbpel-v2.0-OS.html. Ac- cessed 2014-04-25

174 [297] OASIS: Web Services Atomic Transaction (WS-AtomicTransaction) Version 1.2 (2009). URL http://docs.oasis-open.org/ws-tx/wstx-wsat-1.2-spec.html. Accessed 2014-04-23

[298] OASIS: Web Services Business Activity (WS-BusinessActivity) Version 1.2 (2009). URL http://docs.oasis-open.org/ws-tx/wstx-wsba-1.2-spec.html. Ac- cessed 2014-07-17

[299] OASIS: Web Services Coordination (WS-Coordination) Version 1.2 (2009). URL http: //docs.oasis-open.org/ws-tx/wstx-wscoor-1.2-spec-os.pdf. Accessed 2014- 04-23

[300] OASIS: Web Services Reliable Messaging (WS-ReliableMessaging) Version 1.2 (2009). URL http://docs.oasis-open.org/ws-rx/wsrm/v1.2/wsrm.html. Accessed 2014-07-16

[301] OASIS: WS-SecureConversation 1.4 (2009). URL http://docs.oasis-open.org/ ws-sx/ws-secureconversation/v1.4/ws-secureconversation.html. Accessed 2014-07-16

[302] OASIS: Advanced Message Queuing Protocol v1.0 (2011). URL http://docs.oasis- open.org/amqp/core/v1.0/os/amqp-core-overview-v1.0-os.html. Accessed 2014-02-19

[303] OASIS: Reference Architecture Foundation for Service Oriented Architecture Version 1.0 (2012). URL http://docs.oasis-open.org/soa-rm/soa-ra/v1.0/soa-ra.html. Accessed 2014-08-04

[304] OASIS: SAML Version 2.0 Errata 05 (2012). URL http://docs.oasis-open.org/ security/saml/v2.0/errata05/os/saml-v2.0-errata05-os.html. Accessed 2015-01-22

[305] OASIS: WS-SecurityPolicy 1.3 (2012). URL http://docs.oasis-open.org/ws-sx/ ws-securitypolicy/v1.3/ws-securitypolicy.html. Accessed 2014-07-17

[306] OASIS: WS-Trust 1.4 (2012). URL http://docs.oasis-open.org/ws-sx/ws- trust/v1.4/errata01/os/ws-trust-1.4-errata01-os-complete.html. Ac- cessed 2014-07-17

[307] OASIS: MQTT Version 3.1.1 (2014). URL http://docs.oasis-open.org/mqtt/ mqtt/v3.1.1/mqtt-v3.1.1.html. Accessed 2014-07-21

[308] Object Management Group: Documents Associated With Data Distribution Services, V1.2 (2007). URL http://www.omg.org/spec/DDS/1.2/. Accessed 2014-02-20

[309] Object Management Group: Documents Associated With The Real-Time Publish-Subscribe Wire Protocol DDS Interoperability Wire Protocol Specification (DDSI), V2.1 (2009). URL http://www.omg.org/spec/DDSI/2.1/. Accessed 2014-02-20

[310] Object Management Group: Documents Associated With CORBA, 3.3 (2012). URL http://www.omg.org/spec/CORBA/3.3/. Accessed 2014-02-20

[311] Object Management Group: IIOP: OMG’s Internet Inter-ORB Protocol, A Brief Description (2012). URL http://www.omg.org/library/iiop4.html. Accessed 2014-02-20

[312] Object Management Group: Documents Associated With BPMN Version 2.0.2 (2014). URL http://www.omg.org/spec/BPMN/2.0.2/. Accessed 2014-04-25

175 [313] Object Management Group: Documents Associated With DDS Security (DDS-Security) 1.0 - Beta 1 (2014). URL http://www.omg.org/spec/DDS-SECURITY/1.0/Beta1/. Accessed 2014-07-25

[314] Odersky, M., al.: An overview of the scala programming language. Tech. Rep. IC/2004/64, EPFL Lausanne, Switzerland (2004)

[315] Oncina, J., García, P.: Identifying regular languages in polynomial time. In: Advances in Structural and Syntactic Pattern Recognition, Series in Machine Perception and Artificial Intelligence, vol. 5, pp. 99–108. World Scientific Singapore (1992)

[316] Open Information Security Foundation: Suricata (2015). URL http://suricata-ids. org/. Accessed 2015-03-10

[317] OpenMAMA: Introduction to OpenMAMA (2014). URL http://www.openmama.org/ what-is-openmama/introduction-to-openmama. Accessed 2014-07-09

[318] OpenSocial and Gadgets Specification Group: Opensocial specification 2.5.1 (2013). URL http://opensocial.github.io/spec/2.5.1/OpenSocial-Specification. xml. Accessed 2014-07-08

[319] OpenSuSe Documentation: Understanding linux audit (2008). URL https: //www.novell.com/de-de/documentation/opensuse111/opensuse111_ security/data/cha_audit_comp.html. Accessed 2015-02-22

[320] Oracle: Oracle Tuxedo Message Queue Product Overview (2013). URL http://docs. oracle.com/cd/E35855_01/otmq/docs12c/overview/overview.html. Accessed 2014-04-28

[321] Oracle: Java Message Service (2014). URL http://www.oracle.com/technetwork/ java/jms/index.html. Accessed 2014-02-19

[322] Oracle: Java remote method invocation – for java (2014). URL http://www.oracle.com/technetwork/java/javase/tech/index- jsp-138781.html. Accessed 2014-02-19

[323] Oracle: Java RMI over IIOP (2014). URL http://docs.oracle.com/javase/7/docs/ technotes/guides/rmi-iiop/index.html. Accessed 2014-07-09

[324] Osborne, J., Diquet, A.: When security gets in the way – pentesting mobile apps that use certificate pinning (2012). URL https://media.blackhat.com/bh-us-12/Turbo/ Diquet/BH_US_12_Diqut_Osborne_Mobile_Certificate_Pinning_Slides.pdf. BlackHat USA, Accessed 2014-07-29

[325] OWASP: Regular expression Denial of Service (2012). URL https://www.owasp.org/ index.php/Regular_expression_Denial_of_Service_-_ReDoS. Accessed 2015- 03-16

[326] OWASP: Certificate and Public Key Pinning (2014). URL https://www.owasp.org/ index.php/Certificate_and_Public_Key_Pinning. Accessed 2014-07-29

[327] Paoli, J.: Speed and mobility: An approach for http 2.0 to make mobile apps and the web faster (2012). URL http://blogs.msdn.com/b/interoperability/ archive/2012/03/25/speed-and-mobility-an-approach-for-http-2-0-to- make-mobile-apps-and-the-web-faster.aspx. Accessed 2014-06-04

176 [328] Papakonstantinou, Y., Vianu, V.: DTD inference for views of XML data. In: Proceedings of the 19th ACM SIGMOD-SIGACT-SIGART symposium on Principles of Database Systems, PODS’00, pp. 35–46. ACM (2000)

[329] Paterson, I., Saint-Andre, P., Stout, L., Tilanus, W.: XEP-0206: XMPP Over BOSH (2014). URL http://xmpp.org/extensions/xep-0206.html. Accessed 2014-07-23

[330] Paterson, I., Smith, D., Saint-Andre, P., Moffitt, J.: XEP-0124: Bidirectional-streams Over Synchronous HTTP (BOSH) (2010). URL http://xmpp.org/extensions/xep- 0124.html. Accessed 2014-03-04

[331] Patterson, M.L., Bratus, S.: Language-theoretic security: Compositional correctness for the real world. In: Security ’13 BoF. USENIX (2013). URL http://langsec.org/bof- handout.pdf. Accessed 2016-04-12

[332] Pautasso, C.: Composing restful services with jopera. In: Software Composition, Lecture Notes in Computer Science, vol. 5634, pp. 142–159. Springer Berlin Heidelberg (2009)

[333] Pautasso, C.: RESTful Web service composition with BPEL for REST. Data & Knowledge Engineering 68(9), 851–866 (2009) [334] Pautasso, C.: Bpmn for rest. In: Business Process Model and Notation, Lecture Notes in Business Information Processing, vol. 95, pp. 74–87. Springer Berlin Heidelberg (2011)

[335] Pautasso, C., Wilde, E.: Why is the web loosely coupled?: A multi-faceted metric for service design. In: Proceedings of the 18th International Conference on World Wide Web, WWW’09, pp. 911–920. ACM (2009)

[336] Pautasso, C., Zimmermann, O., Leymann, F.: Restful web services vs. "big"’ web ser- vices: Making the right architectural decision. In: Proceedings of the 17th International Conference on World Wide Web, WWW’08, pp. 805–814. ACM (2008)

[337] Paxson, V.: Bro: A system for detecting network intruders in real-time. Computer Networks 31(23-24), 2435–2463 (1999) [338] Perdisci, R., Ariu, D., Fogla, P., Giacinto, G., Lee, W.: Mcpad: A multiple classifier system for accurate payload-based anomaly detection. Computer Networks 53(6), 864–881 (2009) [339] Phelan, T.: Datagram Transport Layer Security (DTLS) over the Datagram Congestion Control Protocol (DCCP). RFC 5238 (Proposed Standard) (2008). URL http://www. ietf.org/rfc/rfc5238.txt

[340] Picalausa, F., Servais, F., Zimányi, E.: XEvolve: an XML schema evolution framework. In: Proceedings of the 2011 ACM Symposium on Applied Computing, SAC’11, pp. 1645–1650. ACM (2011)

[341] Plattner, B., Nievergelt, J.: Monitoring program execution: A survey. Computer 14(11), 76–93 (1981)

[342] Postel, J.: User Datagram Protocol. RFC 768 (INTERNET STANDARD) (1980). URL http://www.ietf.org/rfc/rfc768.txt

[343] Postel, J.: Internet Protocol. RFC 791 (INTERNET STANDARD) (1981). URL http: //www.ietf.org/rfc/rfc791.txt. Updated by RFCs 1349, 2474, 6864

[344] Postel, J.: Transmission Control Protocol. RFC 793 (INTERNET STANDARD) (1981). URL http://www.ietf.org/rfc/rfc793.txt. Updated by RFCs 1122, 3168, 6093, 6528

177 [345] Postel, J., Reynolds, J.: File Transfer Protocol. RFC 959 (INTERNET STANDARD) (1985). URL http://www.ietf.org/rfc/rfc959.txt. Updated by RFCs 2228, 2640, 2773, 3659, 5797, 7151 [346] Ptacek, T.H., Newsham, T.N.: Insertion, evasion, and denial of service: Eluding network intrusion detection. Tech. rep., Secure Networks, Inc. (1998). URL http://insecure. org/stf/secnet_ids/secnet_ids.html. Accessed 2013-10-13 [347] Rackspace: Cloud Queues (2014). URL http://docs.rackspace.com/queues/api/ v1.0/cq-gettingstarted/content/DB_Overview.html. Accessed 2014-07-21 [348] Raeymaekers, S., Bruynooghe, M., den Bussche, J.: Learning (k, l)-contextual tree lan- guages for information extraction from web pages. Machine Learning 71(2), 155–183 (2008) [349] Rahaman, M.A., Schaad, A., Rits, M.: Towards secure SOAP message exchange in a SOA. In: Proceedings of the 3rd ACM Workshop on Secure Web Services, SWS’06, pp. 77–84. ACM (2006) [350] Ramsdell, B., Turner, S.: Secure/Multipurpose Internet Mail Extensions (S/MIME) Version 3.2 Message Specification. RFC 5751 (Proposed Standard) (2010). URL http://www. ietf.org/rfc/rfc5751.txt

[351] Rescorla, E.: HTTP Over TLS. RFC 2818 (Informational) (2000). URL http://www. ietf.org/rfc/rfc2818.txt. Updated by RFCs 5785, 7230 [352] Rescorla, E.: SSL and TLS: designing and building secure systems. Addison-Wesley (2001) [353] Rescorla, E., Modadugu, N.: Datagram Transport Layer Security Version 1.2. RFC 6347 (Proposed Standard) (2012). URL http://www.ietf.org/rfc/rfc6347.txt [354] Richters, M., Gogolla, M.: Aspect-oriented monitoring of uml and ocl constraints. In: AOSD Modeling With UML Workshop, 6th International Conference on the Unified Modeling Language (UML) (2003) [355] Rieck, K.: Machine learning for application-layer intrusion detection. Ph.D. thesis, Berlin Institute of Technology, TU Berlin, Germany (2009) [356] Rieck, K., Krüger, T., Brefeld, U., Müller, K.R.: Approximate tree kernels. Journal of Machine Learning Research 11, 555–580 (2010) [357] Rieck, K., Laskov, P.: Detecting unknown network attacks using language models. In: Detection of Intrusions and Malware & Vulnerability Assessment, DIMVA’06, Lecture Notes in Computer Science, vol. 4064, pp. 74–90. Springer Berlin Heidelberg (2006) [358] Rieck, K., Laskov, P.: Language models for detection of unknown attacks in network traffic. Journal in Computer Virology 2(4), 243–256 (2007) [359] Rieck, K., Wahl, S., Laskov, P., Domschitz, P., Müller, K.R.: A self-learning system for detection of anomalous sip messages. In: Principles, Systems and Applications of IP Telecommunications. Services and Security for Next Generation Networks, IPTComm’08, Lecture Notes in Computer Science, vol. 5310, pp. 90–106. Springer Berlin Heidelberg (2008) [360] Robertson, W., Maggi, F., Kruegel, C., Vigna, G.: Effective anomaly detection with scarce training data. In: Proceedings of the Network and Distributed System Security Symposium, NDSS’10 (2010)

178 [361] Robertson, W., Vigna, G., Kruegel, C., Kemmerer, R.: Using generalization and characteri- zation techniques in the anomaly-based detection of web attacks. In: Proceedings of the Network and Distributed System Security Symposium, NDSS’06 (2006)

[362] Roesch, M.: Snort - lightweight intrusion detection for networks. In: Proceedings of the 13th USENIX Conference on System Administration, LISA’99, pp. 229–238. USENIX Association (1999)

[363] Roskind, J.: QUIC: Multiplexed stream transport over udp (2013). URL https://docs.google.com/document/d/1RNHkx_VvKWyWg6Lr8SZ-saqsQx7rFV- ev2jRFUoVD34. Accessed 2014-04-30

[364] RSS Advisory Board: RSS 2.0 Specification (2009). URL http://www.rssboard.org/ rss-specification. Accessed 2014-02-21

[365] Rubinstein, B.I., Nelson, B., Huang, L., Joseph, A.D., Lau, S.h., Rao, S., Taft, N., Tygar, J.D.: Antidote: understanding and defending against poisoning of anomaly detectors. In: Proceedings of the 9th ACM SIGCOMM Conference on Internet Measurement, IMC’09, pp. 1–14. ACM (2009)

[366] Russell, A.: Comet: Low latency data for the browser (2006). URL http:// infrequently.org/2006/03/comet-low-latency-data-for-the-browser/. Ac- cessed 2014-02-20

[367] Ryck, P., Decat, M., Desmet, L., Piessens, F., Joosen, W.: Security of web mashups: A survey. In: Information Security Technology for Applications, Lecture Notes in Computer Science, vol. 7127, pp. 223–238. Springer Berlin Heidelberg (2012)

[368] Sabelfeld, A., Myers, A.: Language-based information-flow security. IEEE Journal on Selected Areas in Communications 21(1), 5–19 (2003) [369] Saint-Andre, P.: Extensible Messaging and Presence Protocol (XMPP): Core. RFC 6120 (Proposed Standard) (2011). URL http://www.ietf.org/rfc/rfc6120.txt

[370] Saint-Andre, P., Šimerda, P.: XEP-0231: Bits of Binary (2008). URL http://xmpp.org/ extensions/xep-0231.html. Accessed 2014-07-25

[371] Salama, B.S.: Approximate datatypes for anomaly detection in XML-based protocols. Master’s thesis, Johannes Kepler University Linz (2014)

[372] Salfner, F., Lenk, M., Malek, M.: A survey of online failure prediction methods. ACM Computing Surveys 42(3), 1–42 (2010) [373] Sandhu, R., Samarati, P.: Access control: principle and practice. IEEE Communications Magazine 32(9), 40–48 (1994) [374] Sankey, J., Wong, R.K.: Structural inference for semistructured data. In: Proceedings of the 10th International Conference on Information and Knowledge Management, CIKM’01, pp. 159–166 (2001)

[375] SAP: Message Flow Monitoring (2011). URL http://docs.oracle.com/cd/E21764_ 01/core.1111/e10043/audintro.htm. Accessed 2014-09-11

[376] Sassaman, L., Patterson, M., Bratus, S., Locasto, M.: Security applications of formal language theory. IEEE Systems Journal 7(3), 489–500 (2013) [377] SAX Project: Simple API for XML (SAX) (2004). URL http://www.saxproject.org/. Accessed 2014-06-05

179 [378] Scarfone, K., Mell, P.: Guide to Intrusion Detection and Prevention Systems (IDPS). Tech. Rep. SP 800-94, National Institute of Standards & Technology (2007). URL http://csrc. nist.gov/publications/nistpubs/800-94/SP800-94.pdf. Accessed 2015-01-22

[379] Scharf, M., Ford, A.: Multipath TCP (MPTCP) Application Interface Considerations. RFC 6897 (Informational) (2013). URL http://www.ietf.org/rfc/rfc6897.txt

[380] Schewe, K.D., Bósa, K., Lampesberger, H., Ma, J., Rady, M., Vleju, M.B.: Challenges in cloud computing. Scalable Computing: Practice and Experience 12(4), 385–390 (2011)

[381] Schewe, K.D., Thalheim, B., Wang, Q.: Updates, schema updates and validation of XML documents - using abstract state machines with automata-defined states. Journal of Universal Computer Science 15(10), 2028–2057 (2009)

[382] Schneider, F.B.: Enforceable security policies. ACM Transactions on Information and System Security 3(1), 30–50 (2000)

[383] Scholkopf, B., Smola, A.J.: Learning with Kernels: Support Vector Machines, Regulariza- tion, Optimization, and Beyond. MIT Press, Cambridge, MA, USA (2001)

[384] Schroeder, B.: On-line monitoring: a tutorial. Computer 28(6), 72–78 (1995)

[385] Scripting News, Inc: XML-RPC.com (1999). URL http://xmlrpc.scripting.com/ default.html. Accessed 2014-02-20

[386] Segoufin, L., Vianu, V.: Validating streaming XML documents. In: Proceedings ofthe21st ACM Symposium on Principles of Database Systems, PODS’02, pp. 53–64. ACM (2002)

[387] Sekar, R., Bendre, M., Dhurjati, D., Bollineni, P.: A fast automaton-based method for detecting anomalous program behaviors. In: IEEE Symposium on Security and Privacy, S&P’01, pp. 144–155. IEEE Comput. Soc (2001)

[388] Shelby, Z.: Constrained RESTful Environments (CoRE) Link Format. RFC 6690 (Proposed Standard) (2012). URL http://www.ietf.org/rfc/rfc6690.txt

[389] Shelby, Z., Hartke, K., Bormann, C.: The Constrained Application Protocol (CoAP). RFC 7252 (Proposed Standard) (2014). URL http://www.ietf.org/rfc/rfc7252.txt

[390] Software AG: Terracotta Universal Messaging (2014). URL https://www.softwareag. com/corporate/images/SAG_Terracotta_Universal_Messaging_FS_Jun14_ Web_tcm16-114090.pdf. Accessed 2014-06-04

[391] Somayaji, A., Forrest, S.: Automated response using system-call delays. In: Proceedings of the 9th USENIX Security Symposium, SECURITY’00 (2000)

[392] Sommer, R., Paxson, V.: Outside the closed world: On using machine learning for network intrusion detection. In: 2010 IEEE Symposium on Security and Privacy, S&P’10, pp. 305–316. IEEE (2010)

[393] Somorovsky, J., Heiderich, M., Jensen, M., Schwenk, J., Gruschka, N., Lo Iacono, L.: All your clouds are belong to us: Security analysis of cloud management interfaces. In: Proceedings of the 3rd ACM Workshop on Cloud Computing Security Workshop, CCSW’11, pp. 3–14. ACM (2011)

[394] Somorovsky, J., Mayer, A., Schwenk, J., Kampmann, M., Jensen, M.: On breaking SAML: Be whoever you want to be. In: Proceedings of the 21st USENIX Conference on Security Symposium, Security’12, pp. 21–21. USENIX Association (2012)

180 [395] Song, Y., Keromytis, A., Stolfo, S.J.: Spectrogram: A mixture-of-markov-chains model for anomaly detection in web traffic. In: Proceedings of the Network and Distributed System Security Symposium, NDSS’09 (2009)

[396] Song, Y., Locasto, M.E., Stavrou, A., Keromytis, A.D., Stolfo, S.J.: On the infeasibility of modeling polymorphic shellcode. In: Proceedings of the 14th ACM Conference on Computer and Communications Security, CCS’07, pp. 541–551. ACM (2007)

[397] Souders, S.: Sharding dominant domains (2009). URL http://www.stevesouders. com/blog/2009/05/12/sharding-dominant-domains/. Accessed 2014-06-04

[398] Sperotto, A., Pras, A.: Flow-based intrusion detection. In: 12th IFIP/IEEE International Symposium on Integrated Network Management, IM’11, pp. 958–963. IEEE (2011)

[399] Sperotto, A., Schaffrath, G., Sadre, R., Morariu, C., Pras, A., Stiller, B.: An overview of ip flow-based intrusion detection. IEEE Communications Surveys & Tutorials 12(3), 343–356 (2010)

[400] Stephenson, D.: XML Schema best practices. Tech. rep., HP (2004). URL http:// xml.coverpages.org/HP-StephensonSchemaBestPractices.pdf. Accessed 2015- 03-25

[401] Stewart, R.: Stream Control Transmission Protocol. RFC 4960 (Proposed Standard) (2007). URL http://www.ietf.org/rfc/rfc4960.txt. Updated by RFCs 6096, 6335, 7053

[402] STOMP: The Simple Text Oriented Messaging Protocol v1.2 (2012). URL http://stomp. github.io/stomp-specification-1.2.html. Accessed 2014-02-19

[403] Limited: Message queues as a service in the cloud (2013). URL http:// stormmq.com/. Accessed 2014-02-21

[404] Stout, L., Moffitt, J., Cestari, E.: An XMPP Sub-protocol for WebSocket (2014). URL https://datatracker.ietf.org/doc/draft-ietf-xmpp-websocket/. Accessed 2014-06-16

[405] Tax, D.M.J., Duin, R.P.W.: Support vector data description. Machine Learning 54(1), 45–66 (2004)

[406] Tellenbach, B., Burkhart, M., Sornette, D., Maillart, T.: Beyond shannon: Characterizing internet traffic with generalized entropy metrics. In: Passive and Active Network Mea- surement, PAM’09, Lecture Notes in Computer Science, vol. 5448, pp. 239–248. Springer Berlin Heidelberg (2009)

[407] The Apache Software Foundation: Apache module mod_proxy (2013). URL http: //httpd.apache.org/docs/2.0/mod/mod_proxy.html. Accessed 2013-11-18

[408] The Chromium Blog: New chromium security features, june 2011 (2011). URL http://blog.chromium.org/2011/06/new-chromium-security-features- june.html. Accessed 2014-07-29

[409] The Chromium Projects: SPDY: An experimental protocol for a faster web. URL http: //dev.chromium.org/spdy/spdy-whitepaper. Accessed 2014-03-05

[410] The Network Encyclopedia: circuit level gateway (2013). URL http://www. thenetworkencyclopedia.com/entry/circuit-level-gateway/. Accessed 2014- 09-15

181 [411] The Open Group: CAE Specification, DCE 1.1: Remote Procedure Call (1997). URL http://pubs.opengroup.org/onlinepubs/9629399/. Accessed 2014-07-09

[412] Thomo, A., Venkatesh, S., Ye, Y.: Visibly pushdown transducers for approximate validation of streaming XML. In: Foundations of Information and Knowledge Systems, Lecture Notes in Computer Science, vol. 4932, pp. 219–238. Springer Berlin Heidelberg (2008)

[413] Thottan, M., Ji, C.: Anomaly detection in ip networks. IEEE Transactions on Signal Processing 51(8), 2191–2204 (2003)

[414] Thurlow, R.: RPC: Remote Procedure Call Protocol Specification Version 2. RFC 5531 (Draft Standard) (2009). URL http://www.ietf.org/rfc/rfc5531.txt

[415] TIBCO: Enterprise Message Service (2014). URL http://www.tibco.com/ products/automation/enterprise-messaging/enterprise-message- service/default.jsp. Accessed 2014-07-21

[416] TIBCO: Rendezvous Messaging Middleware (2014). URL http://www.tibco.com/ products/automation/enterprise-messaging/rendezvous/default.jsp. Ac- cessed 2014-07-22

[417] Toosi, A.N., Calheiros, R.N., Buyya, R.: Interconnected cloud computing environments: Challenges, taxonomy, and survey. ACM Comput. Surv. 47(1), 7:1–7:47 (2014)

[418] TrustedBSD Project: Openbsm: Open source basic security module (bsm) audit im- plementation (2009). URL http://www.trustedbsd.org/openbsm.html. Accessed 2015-02-18

[419] Tuexen, M., Seggelmann, R., Rescorla, E.: Datagram Transport Layer Security (DTLS) for Stream Control Transmission Protocol (SCTP). RFC 6083 (Proposed Standard) (2011). URL http://www.ietf.org/rfc/rfc6083.txt

[420] Twitter, Inc.: finagle (2014). URL http://twitter.github.io/finagle/guide/. Accessed 2014-07-15

[421] Unicode, Inc.: What is unicode? (2013). URL http://www.unicode.org/standard/ WhatIsUnicode.html. Accessed 2014-06-16

[422] Valdes, A., Skinner, K.: Adaptive, model-based monitoring for cyber attack detection. In: Recent Advances in Intrusion Detection, RAID’00, Lecture Notes in Computer Science, vol. 1907, pp. 80–93. Springer Berlin Heidelberg (2000)

[423] Celeda,ˇ P., Krmícek,ˇ V.: Flow data collection in large scale networks. In: Advances in IT Early Warning, pp. 30–40. Fraunhofer Verlag (2013)

[424] Veen, R.: The OGDL Specification (2014). URL http://ogdl.org/spec/. Accessed 2014-02-21

[425] Vidal, E., Llorens, D.: Using knowledge to improve n-gram language modelling through the mggi methodology. In: Grammatical Interference: Learning Syntax from Sentences, Lecture Notes in Computer Science, vol. 1147, pp. 179–190. Springer Berlin Heidelberg (1996)

[426] Vlist, E.v.d.: RELAX NG. O’Reilly Media, Inc. (2003)

[427] Vogels, W.: Web services are not distributed objects. IEEE Internet Computing 7(6), 59–66 (2003)

182 [428] Vorobiev, A., Han, J.: Security attack ontology for web services. In: Second International Conference on Semantics, Knowledge and Grid, SKG’06, pp. 42–47. IEEE (2006)

[429] W3C: HTTP-NG Binary Wire Protocol (1998). URL http://www.w3.org/TR/WD-HTTP- NG-wire/. Accessed 2014-06-04

[430] W3C: SMUX Protocol Specification (1998). URL http://www.w3.org/TR/WD-mux. Accessed 2014-06-04

[431] W3C: HTML 4.01 Specification (1999). URL http://www.w3.org/TR/html4/. Ac- cessed 2014-02-20

[432] W3C: XML Path Language (XPath) 1.0 (1999). URL http://www.w3.org/TR/xpath/. Accessed 2015-03-25

[433] W3C: Web Services Description Language (WSDL) 1.1 (2001). URL http://www.w3. org/TR/wsdl. Accessed 2014-02-21

[434] W3C: XHTML 1.0 The Extensible HyperText Markup Language (Second Edition) (2002). URL http://www.w3.org/TR/xhtml1/. Accessed 2014-02-20

[435] W3C: XML Encryption Syntax and Processing (2002). URL http://www.w3.org/TR/ xmlenc-core/. Accessed 2015-03-26

[436] W3C: XML Pointer Language (XPointer) (2002). URL http://www.w3.org/TR/xptr/. Accessed 2015-03-31

[437] W3C: XML-Signature XPath Filter 2.0 (2002). URL http://www.w3.org/TR/xmldsig- filter2/. Accessed 2015-04-02

[438] W3C: Document Object Model (DOM) Level 3 Core Specification (2004). URL http: //www.w3.org/TR/DOM-Level-3-Core/. Accessed 2014-02-17

[439] W3C: RDF/XML Syntax Specification (Revised) (2004). URL http://www.w3.org/TR/ rdf-syntax-grammar/. Accessed 2014-02-20

[440] W3C: SOAP 1.2 Attachment Feature (2004). URL http://www.w3.org/TR/soap12- af/. Accessed 2014-07-15

[441] W3C: Web Services Addressing (WS-Addressing) (2004). URL http://www.w3.org/ Submission/ws-addressing/. Accessed 2014-03-03

[442] W3C: XML Information Set (Second Edition) (2004). URL http://www.w3.org/TR/ xml-infoset/. Accessed 2014-08-02

[443] W3C: XML Schema Part 0: Primer Second Edition (2004). URL http://www.w3.org/ TR/xmlschema-0/. Accessed 2014-07-14

[444] W3C: Describing Media Content of Binary Data in XML (2005). URL http://www.w3. org/TR/xml-media-types/. Accessed 2014-07-15

[445] W3C: SOAP Message Transmission Optimization Mechanism (2005). URL http://www. w3.org/TR/soap12-mtom/. Accessed 2014-02-20

[446] W3C: Web Services Choreography Description Language Version 1.0) (2005). URL http://www.w3.org/TR/ws-cdl-10/. Accessed 2014-04-25

[447] W3C: SOAP Version 1.2 Part 1: Messaging Framework (Second Edition) (2007). URL http://www.w3.org/TR/soap12-part1/. Accessed 2014-02-20

183 [448] W3C: Web Services Policy 1.5 - Primer (2007). URL http://www.w3.org/TR/ws- policy-primer/. Accessed 2014-07-17

[449] W3C: XQuery 1.0 and XPath 2.0 Data Model (XDM) (Second Edition) (2007). URL http://www.w3.org/TR/xpath-datamodel/. Accessed 2015-03-25

[450] W3C: XSL Transformations (XSLT) Version 2.0 (2007). URL http://www.w3.org/TR/ xslt20/. Accessed 2015-03-25

[451] W3C: Extensible Markup Language (XML) 1.0 (Fifth Edition) (2008). URL http://www. w3.org/TR/2008/REC-xml-20081126/. Accessed 2014-02-17

[452] W3C: XML-binary Optimized Packaging (2008). URL http://www.w3.org/TR/ xop10/. Accessed 2014-02-20

[453] W3C: XML Signature Syntax and Processing (Second Edition) (2008). URL http: //www.w3.org/TR/xmlschema-0/. Accessed 2015-01-23

[454] W3C: Namespaces in XML 1.0 (Third Edition) (2009). URL http://www.w3.org/TR/ xml-names/. Accessed 2015-03-31

[455] W3C: Web Application Description Language (2009). URL http://www.w3.org/ Submission/wadl/. Accessed 2014-03-28

[456] W3C: XML Path Language (XPath) 2.0 (2010). URL http://www.w3.org/TR/ xpath20/. Accessed 2015-03-25

[457] W3C: XQuery 1.0: An XML Query Language (Second Edition) (2010). URL http: //www.w3.org/TR/xquery/. Accessed 2015-03-25

[458] W3C: Web Services Eventing (WS-Eventing) (2011). URL http://www.w3.org/TR/ws- eventing/. Accessed 2014-07-14

[459] W3C: Content Security Policy 1.0 (2012). URL http://www.w3.org/TR/rdfa-core/. Accessed 2014-07-02

[460] W3C: OWL 2 Web Ontology Language Document Overview (Second Edition) (2012). URL http://www.w3.org/TR/owl2-overview/. Accessed 2014-07-02

[461] W3C: Server-Sent Events (2012). URL http://www.w3.org/TR/eventsource/. Ac- cessed 2014-06-03

[462] W3C: The WebSocket API (2012). URL http://www.w3.org/TR/websockets/. Ac- cessed 2014-02-21

[463] W3C: W3C XML Schema Definition Language (XSD) 1.1 Part 2: Datatypes (2012). URL http://www.w3.org/TR/xmlschema11-2/. Accessed 2015-04-02

[464] W3C: HTML Microdata (2013). URL http://www.w3.org/TR/microdata/. Accessed 2014-07-02

[465] W3C: RDFa Core 1.1 - Second Edition (2013). URL http://www.w3.org/TR/rdfa- core/. Accessed 2014-07-02

[466] W3C: Semantic Web (2013). URL http://www.w3.org/standards/semanticweb/. Accessed 2014-06-30

[467] W3C: WebRTC 1.0: Real-time Communication Between Browsers (2013). URL http: //www.w3.org/TR/webrtc/. Accessed 2014-02-21

184 [468] W3C: A Little History of the World Wide Web (2014). URL http://www.w3.org/ History.html. Accessed 2014-12-05

[469] W3C: Cross-Origin Resource Sharing (2014). URL http://www.w3.org/TR/cors/. Accessed 2014-07-01

[470] W3C: Efficient XML Interchange (EXI) Format 1.0 (Second Edition) (2014). URL http: //www.w3.org/TR/exi/. Accessed 2014-06-18

[471] W3C: HTML5 (2014). URL http://www.w3.org/TR/html5/. Accessed 2014-02-17

[472] W3C: JSON-LD 1.0 (2014). URL http://www.w3.org/TR/json-ld/. Accessed 2014- 07-02

[473] W3C: Polyglot Markup: A robust profile of the HTML5 vocabulary (2014). URL http: //www.w3.org/TR/html-polyglot/. Accessed 2015-02-22

[474] W3C: XML Path Language (XPath) 3.0 (2014). URL http://www.w3.org/TR/xpath- 30/. Accessed 2015-03-25

[475] Wagner, A., Wagner, A., Plattner, B., Plattner, B.: Entropy based worm and anomaly detec- tion in fast ip networks. In: 14th IEEE International Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprise, WETICE’05, pp. 172–177. IEEE (2005)

[476] Wagner, D., Dean, R.: Intrusion detection via static analysis. In: IEEE Symposium on Security and Privacy, S&P’01, pp. 156–168. IEEE Comput. Soc (2001)

[477] Wagner, D., Soto, P.: Mimicry attacks on host-based intrusion detection systems. In: Proceedings of the 9th ACM Conference on Computer and Communications Security, CCS’02, pp. 255–264. ACM (2002)

[478] Wang, J., Bigham, J.: Anomaly detection in the case of message oriented middleware. In: Proceedings of the 2008 Workshop on Middleware Security, MidSec’08, pp. 40–42. ACM (2008)

[479] Wang, K., Parekh, J., Stolfo, S.J.: Anagram: A content anomaly detector resistant to mimicry attack. In: Recent Advances in Intrusion Detection – RAID’06, Lecture Notes of Computer Science, vol. 4219, pp. 226–248. Springer Berlin Heidelberg (2006)

[480] Wang, K., Stolfo, S.J.: Anomalous payload-based network intrusion detection. In: Recent Advances in Intrusion Detection – RAID’04, Lecture Notes of Computer Science, vol. 3224, pp. 203–222. Springer Berlin Heidelberg (2004)

[481] Web Services Interoperability Organization: Basic Profile Version 1.2 (2010). URL http: //ws-i.org/Profiles/BasicProfile-1.2-2010-11-09.html. Accessed 2014-05- 05

[482] WebSphere Software: Introduction to Oracle Fusion Middleware Audit Frame- work (2011). URL http://docs.oracle.com/cd/E21764_01/core.1111/e10043/ audintro.htm. Accessed 2014-09-11

[483] WebSphere Software: Using WebSphere Message Broker log and trace files (2014). URL http://publib.boulder.ibm.com/infocenter/wtxdoc/v8r2m0/index.jsp? topic=/com.ibm.websphere.dtx.wtx4wmb.doc/references/r_wtx4wmb_using_ wmb_log_and_trace_files.htm. Accessed 2014-09-11

[484] White, J.: High-level framework for network-based resource sharing. RFC 707 (1975). URL http://www.ietf.org/rfc/rfc707.txt

185 [485] Wikidot.com: RestMS (2008). URL http://www.restms.org/. Accessed 2014-02-21

[486] Winter, P., Lampesberger, H., Zeilinger, M., Hermann, E.: On detecting abrupt changes in network entropy time series. In: Communications and Multimedia Security, CMS’11, Lecture Notes of Computer Science, vol. 7025, pp. 194–205. Springer Berlin Heidelberg (2011)

[487] Wojtczuk, R.: Libnids (2010). URL http://libnids.sourceforge.net/. Accessed 2013-11-01

[488] Wressnegger, C., Schwenk, G., Arp, D., Rieck, K.: A close look on n-grams in intrusion de- tection: Anomaly detection vs. classification. In: Proceedings of the 2013 ACM Workshop on Artificial Intelligence and Security, AISec’13, pp. 67–76. ACM (2013)

[489] Wu, H., Ling, T., Dobbie, G., Bao, Z., Xu, L.: Reducing graph matching to tree matching for xml queries with id references. In: Database and Expert Systems Applications, Lecture Notes in Computer Science, vol. 6262, pp. 391–406. Springer Berlin Heidelberg (2010). DOI 10.1007/978-3-642-15251-1_31

[490] Xie, Y., Yu, S.Z.: A dynamic anomaly detection model for web user behavior based on HsMM. In: 10th International Conference on Computer Supported Cooperative Work in Design, CSCWD’06, pp. 1–6. IEEE (2006)

[491] Xie, Y., Yu, S.Z.: A large-scale hidden semi-markov model for anomaly detection on user browsing behaviors. IEEE/ACM Transactions on Networking 17(1), 54–65 (2009)

[492] XimpleWare: VTD-XML: The future of XML processing (2012). URL http://vtd- xml.sourceforge.net/. Accessed 2015-02-16

[493] XML-RPC Description Language: Getting started with xrdl (2010). URL https://code. google.com/p/xrdl/wiki/GettingStarted. Accessed 2014-07-09

[494] Yergeau, F.: UTF-8, a transformation format of ISO 10646. RFC 3629 (INTERNET STANDARD) (2003). URL http://www.ietf.org/rfc/rfc3629.txt

[495] Yu, F., Diao, Y., Katz, R.H., Lakshman, T.V.: Fast packet pattern-matching algorithms. In: Algorithms for Next Generation Networks, Computer Communications and Networks, pp. 219–238. Springer London (2010)

[496] Zaha, J., Dumas, M., ter Hofstede, A., Barros, A., Decker, G.: Service interaction modeling: Bridging global and local views. In: 10th IEEE International Enterprise Distributed Object Computing Conference, EDOC’06, pp. 45–55 (2006)

[497] Zanero, S., Savaresi, S.M.: Unsupervised learning techniques for an intrusion detection system. In: Proceedings of the 2004 ACM Symposium on Applied Computing, SAC’04, pp. 412–419. ACM (2004)

[498] ZeroC, Inc.: The Internet Communications Engine (ICE) (2013). URL http://www. zeroc.com/ice.html. Accessed 2014-02-21

[499] Zimmermann, H.: Osi reference model–the iso model of architecture for open systems interconnection. IEEE Transactions on Communications 28(4), 425–432 (1980)

186 Curriculum Vitae

Personal Information Name Harald Lampesberger Date of birth August 12th, 1983 Place of birth Amstetten, Austria Nationality Austria

Education 2011–2016 Doctoral Studies in Technical Sciences, Christian Doppler Laboratory for Client-Centric Cloud Computing, Johannes Kepler University Linz 2008–2010 Master Studies in Secure Information Systems, Upper Austria University of Applied Sciences, Hagenberg 2003–2006 Bachelor Studies in Computer and Media Security, Upper Austria University of Applied Sciences, Hagenberg 1996–2002 Secondary School with focus on Automation Technology, Higher Technical Institute, Waidhofen/Ybbs

Professional Experience 2015– Assistant professor at Upper Austria University of Applied Sciences, Hagenberg 2011–2015 Research fellow at Christian Doppler Laboratory for Client-Centric Cloud Computing, Hagenberg 2010–2015 Visiting lecturer at Upper Austria University of Applied Sciences, Hagenberg 2009–2011 Research fellow at FH OÖ Forschungs & Entwicklungs GmbH, Ha- genberg 2006–2009 Software developer at underground_8 secure computing GmbH, Linz 2005–2008 Self-employed software developer and hosting services 2006–2006 Internship at Siemens AG, CT IC CERT, Munich 2002–2002 IT technician at SCL Schmid GmbH, Spiegelsberg

187 Publications

Lampesberger, H.: An incremental learner for language-based anomaly detection XML. In: 2016 IEEE Symposium on Security and Privacy Workshops, LangSec’16. IEEE (2016). (To appear) Lampesberger, H., Rady, M.: Monitoring of client-cloud interaction. In: Correct Software in Web Applications and Web Services, Texts & Monographs in Symbolic Computation, pp. 177–228. Springer International Publishing (2015) Lampesberger, H.: Technologies for web and cloud service interaction: A survey. Service Oriented Computing and Applications (2015). DOI: 10.1007/s11761-015-0174-1 Lampesberger, H.: A grammatical inference approach to language-based anomaly detec- tion in XML. In: 2013 International Conference on Availability, Reliability and Security, ECTCM’13 Workshop, pp. 685–693. IEEE (2013) Lampesberger, H., Zeilinger, M., Hermann, E.: Statistical modeling of web requests for anomaly detection in web applications. In: Advances in IT Early Warning, pp. 91–101. Fraunhofer Verlag Stuttgart (2013) Lampesberger, H., Winter, P., Zeilinger, M., Hermann, E.: An on-line learning statistical model to detect malicious web requests. In: Security and Privacy in Communication Networks, Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol. 96, pp. 19–38. Springer Berlin Heidelberg (2012) Schewe, K.D., Bósa, K., Lampesberger, H., Ma, J., Rady, M., Vleju, M.B.: Challenges in cloud computing. Scalable Computing: Practice and Experience 12(4), 385–390 (2011) Schewe, K.D., Bósa, K., Lampesberger, H., Ma, J., Vleju, M.B.: The Christian Doppler laboratory for client-centric cloud computing. In: 2nd Workshop on Software Services, WoSS’11. Timisoara, Romania (2011) Winter, P., Lampesberger, H., Zeilinger, M., Hermann, E.: On detecting abrupt changes in network entropy time series. In: Communications and Multimedia Security, Lecture Notes in Computer Science, vol. 7025, pp. 194–205. Springer Berlin Heidelberg (2011). Winter, P., Lampesberger, H., Zeilinger, M., Hermann, E.: Anomalieerkennung in Com- puternetzen. Datenschutz und Datensicherheit - DuD 35, 235–239 (2011). Lampesberger, H.: Empirical evaluation of the Internet Analysis System for application in the field of anomaly detection. In: European Conference on Computer Network Defense, EC2ND’10, pp. 63–70. IEEE (2010) Lampesberger, H.: Empirical evaluation of the Internet Analysis System for application in the field of anomaly detection. Master’s thesis, Upper Austria University of Applied Sciences, Hagenberg, Austria (2010) Lampesberger, H.: Konzept und prototypische Entwicklung eines Malware-Collectors zum Einsatz im Siemens Intranet. Bachelor’s thesis, Upper Austria University of Applied Sciences, Hagenberg, Austria (2006) Lampesberger, H.: Shellcode. Bachelor’s thesis, Upper Austria University of Applied Sciences, Hagenberg, Austria (2006)

188