<<

SCREAM

Rob King

Introduction The Problem The Base64 Static Compilation of Regular Expressions Algorithm Regular Expressions for Analysis and Modification The Algorithm Ways of Solving the Problem Encoding Operations Performance Rob King Expression Optimization Performance Analysis DVLabs Implementation TippingPoint Technologies and Usage Common Use Cases Caveats March 25, 2010 Summary OUTLINE

SCREAM 1 Introduction Rob King The Problem

Introduction The Base64 Algorithm The Problem The Base64 Regular Expressions Algorithm Regular Expressions 2 The Algorithm The Algorithm Ways of Solving the Ways of Solving the Problem Problem Encoding Operations Encoding Operations Performance Expression 3 Performance Optimization Performance Expression Optimization Analysis

Implementation Performance Analysis and Usage Common Use Cases 4 Implementation and Usage Caveats Common Use Cases Summary Caveats 5 Summary THE PROBLEM In which we discover purpose of the whole thing...

SCREAM Rob King This talk is about inspecting streams of data for

Introduction interesting patterns, even when that stream of data has The Problem The Base64 been encoded. Algorithm Regular Expressions We focus on the Base64 encoding scheme, and The Algorithm Ways of Solving the discuss a tool that can be used when dealing with Problem Encoding Operations Base64. Performance Expression However, most portions of the algorithm are applicable Optimization Performance to other position-dependent bitwise block encodings Analysis (and, potentially, self-synchronizing encodings). Implementation and Usage We also want to talk about how Erlang made Common Use Cases Caveats development of the tool much easier. Summary This tool also provided an interesting “back door” to get Erlang accepted in DVLabs. THE PROBLEM In which we discover purpose of the whole thing...

SCREAM Rob King This talk is about inspecting streams of data for

Introduction interesting patterns, even when that stream of data has The Problem The Base64 been encoded. Algorithm Regular Expressions We focus on the Base64 encoding scheme, and The Algorithm Ways of Solving the discuss a tool that can be used when dealing with Problem Encoding Operations Base64. Performance Expression However, most portions of the algorithm are applicable Optimization Performance to other position-dependent bitwise block encodings Analysis (and, potentially, self-synchronizing encodings). Implementation and Usage We also want to talk about how Erlang made Common Use Cases Caveats development of the tool much easier. Summary This tool also provided an interesting “back door” to get Erlang accepted in DVLabs. THE PROBLEM In which we discover purpose of the whole thing...

SCREAM Rob King This talk is about inspecting streams of data for

Introduction interesting patterns, even when that stream of data has The Problem The Base64 been encoded. Algorithm Regular Expressions We focus on the Base64 encoding scheme, and The Algorithm Ways of Solving the discuss a tool that can be used when dealing with Problem Encoding Operations Base64. Performance Expression However, most portions of the algorithm are applicable Optimization Performance to other position-dependent bitwise block encodings Analysis (and, potentially, self-synchronizing encodings). Implementation and Usage We also want to talk about how Erlang made Common Use Cases Caveats development of the tool much easier. Summary This tool also provided an interesting “back door” to get Erlang accepted in DVLabs. THE PROBLEM In which we discover purpose of the whole thing...

SCREAM Rob King This talk is about inspecting streams of data for

Introduction interesting patterns, even when that stream of data has The Problem The Base64 been encoded. Algorithm Regular Expressions We focus on the Base64 encoding scheme, and The Algorithm Ways of Solving the discuss a tool that can be used when dealing with Problem Encoding Operations Base64. Performance Expression However, most portions of the algorithm are applicable Optimization Performance to other position-dependent bitwise block encodings Analysis (and, potentially, self-synchronizing encodings). Implementation and Usage We also want to talk about how Erlang made Common Use Cases Caveats development of the tool much easier. Summary This tool also provided an interesting “back door” to get Erlang accepted in DVLabs. THE PROBLEM In which we discover purpose of the whole thing...

SCREAM Rob King This talk is about inspecting streams of data for

Introduction interesting patterns, even when that stream of data has The Problem The Base64 been encoded. Algorithm Regular Expressions We focus on the Base64 encoding scheme, and The Algorithm Ways of Solving the discuss a tool that can be used when dealing with Problem Encoding Operations Base64. Performance Expression However, most portions of the algorithm are applicable Optimization Performance to other position-dependent bitwise block encodings Analysis (and, potentially, self-synchronizing encodings). Implementation and Usage We also want to talk about how Erlang made Common Use Cases Caveats development of the tool much easier. Summary This tool also provided an interesting “back door” to get Erlang accepted in DVLabs. THE PROBLEM In which we discover purpose of the whole thing...

SCREAM Rob King This talk is about inspecting streams of data for

Introduction interesting patterns, even when that stream of data has The Problem The Base64 been encoded. Algorithm Regular Expressions We focus on the Base64 encoding scheme, and The Algorithm Ways of Solving the discuss a tool that can be used when dealing with Problem Encoding Operations Base64. Performance Expression However, most portions of the algorithm are applicable Optimization Performance to other position-dependent bitwise block encodings Analysis (and, potentially, self-synchronizing encodings). Implementation and Usage We also want to talk about how Erlang made Common Use Cases Caveats development of the tool much easier. Summary This tool also provided an interesting “back door” to get Erlang accepted in DVLabs. CHALLENGES OF DATA STREAM INSPECTION

SCREAM

Rob King

Introduction The Problem The Base64 Algorithm Regular Expressions When looking for patterns in streams of data, several things The Algorithm must be kept in mind: Ways of Solving the Problem Encoding Operations There is no “luxury of time”. Performance Context is limited. Expression Optimization Performance Resources are limited. Analysis

Implementation and Usage Common Use Cases Caveats

Summary CHALLENGES OF DATA STREAM INSPECTION

SCREAM

Rob King

Introduction The Problem The Base64 Algorithm Regular Expressions When looking for patterns in streams of data, several things The Algorithm must be kept in mind: Ways of Solving the Problem Encoding Operations There is no “luxury of time”. Performance Context is limited. Expression Optimization Performance Resources are limited. Analysis

Implementation and Usage Common Use Cases Caveats

Summary CHALLENGES OF DATA STREAM INSPECTION

SCREAM

Rob King

Introduction The Problem The Base64 Algorithm Regular Expressions When looking for patterns in streams of data, several things The Algorithm must be kept in mind: Ways of Solving the Problem Encoding Operations There is no “luxury of time”. Performance Context is limited. Expression Optimization Performance Resources are limited. Analysis

Implementation and Usage Common Use Cases Caveats

Summary CHALLENGES OF DATA STREAM INSPECTION

SCREAM

Rob King

Introduction The Problem The Base64 Algorithm Regular Expressions When looking for patterns in streams of data, several things The Algorithm must be kept in mind: Ways of Solving the Problem Encoding Operations There is no “luxury of time”. Performance Context is limited. Expression Optimization Performance Resources are limited. Analysis

Implementation and Usage Common Use Cases Caveats

Summary ENCODED STREAMS

SCREAM

Rob King

Introduction The Problem The Base64 Algorithm Regular Expressions

The Algorithm Sometimes, the streams we’re inspecting will be Ways of Solving the Problem encoded. Encoding Operations

Performance This means that we’re going to have to be (more!) Expression Optimization clever when looking for patterns in these streams. Performance Analysis

Implementation and Usage Common Use Cases Caveats

Summary DEALING WITH ENCODED STREAMS

SCREAM

Rob King

Introduction The Problem The Base64 Algorithm Regular Expressions

The Algorithm Ways of Solving the Problem There are several general strategies for dealing with Encoding Operations encoded streams. Performance Expression Optimization Performance Analysis

Implementation and Usage Common Use Cases Caveats

Summary DEALING WITH ENCODED STREAMS Strategy 1: Ignore the Encoding

SCREAM

Rob King

Introduction The Problem The easiest thing to do is simply pretend the stream is The Base64 Algorithm Regular Expressions not encoded at all. The Algorithm Advantages: Ways of Solving the Problem We’re already done. Encoding Operations

Performance Disadvantages: Expression Optimization We’re essentially admitting defeat. Performance Analysis Whatever we were looking for is not going to be found. Implementation We’re stil burdening our analysis engine with lots of and Usage Common Use Cases data with which we can do nothing. Caveats

Summary DEALING WITH ENCODED STREAMS Strategy 1: Ignore the Encoding

SCREAM

Rob King

Introduction The Problem The easiest thing to do is simply pretend the stream is The Base64 Algorithm Regular Expressions not encoded at all. The Algorithm Advantages: Ways of Solving the Problem We’re already done. Encoding Operations

Performance Disadvantages: Expression Optimization We’re essentially admitting defeat. Performance Analysis Whatever we were looking for is not going to be found. Implementation We’re stil burdening our analysis engine with lots of and Usage Common Use Cases data with which we can do nothing. Caveats

Summary DEALING WITH ENCODED STREAMS Strategy 1: Ignore the Encoding

SCREAM

Rob King

Introduction The Problem The easiest thing to do is simply pretend the stream is The Base64 Algorithm Regular Expressions not encoded at all. The Algorithm Advantages: Ways of Solving the Problem We’re already done. Encoding Operations

Performance Disadvantages: Expression Optimization We’re essentially admitting defeat. Performance Analysis Whatever we were looking for is not going to be found. Implementation We’re stil burdening our analysis engine with lots of and Usage Common Use Cases data with which we can do nothing. Caveats

Summary DEALING WITH ENCODED STREAMS Strategy 2: Ignore the Data

SCREAM Rob King Rather than pretending that the data is not encoded,

Introduction we could go one step further, and detect that the data is The Problem encoded. The Base64 Algorithm Regular Expressions Once we’ve detected that the data is encoded, we can The Algorithm simply drop the stream. Ways of Solving the Problem Encoding Operations Advantages: Performance Stops burdening with inspection engine with data we Expression Optimization know we can’t inspect. Performance Analysis Disadvantages: Implementation and Usage If we “fail open” and allow encoded data to pass without Common Use Cases inspection, we just gave anyone who wants to bypass Caveats our inspection a “get out of jail free” card. Summary If we “fail closed” and block encoded data, we just blocked all legitimate uses of that encoding scheme. DEALING WITH ENCODED STREAMS Strategy 2: Ignore the Data

SCREAM Rob King Rather than pretending that the data is not encoded,

Introduction we could go one step further, and detect that the data is The Problem encoded. The Base64 Algorithm Regular Expressions Once we’ve detected that the data is encoded, we can The Algorithm simply drop the stream. Ways of Solving the Problem Encoding Operations Advantages: Performance Stops burdening with inspection engine with data we Expression Optimization know we can’t inspect. Performance Analysis Disadvantages: Implementation and Usage If we “fail open” and allow encoded data to pass without Common Use Cases inspection, we just gave anyone who wants to bypass Caveats our inspection a “get out of jail free” card. Summary If we “fail closed” and block encoded data, we just blocked all legitimate uses of that encoding scheme. DEALING WITH ENCODED STREAMS Strategy 2: Ignore the Data

SCREAM Rob King Rather than pretending that the data is not encoded,

Introduction we could go one step further, and detect that the data is The Problem encoded. The Base64 Algorithm Regular Expressions Once we’ve detected that the data is encoded, we can The Algorithm simply drop the stream. Ways of Solving the Problem Encoding Operations Advantages: Performance Stops burdening with inspection engine with data we Expression Optimization know we can’t inspect. Performance Analysis Disadvantages: Implementation and Usage If we “fail open” and allow encoded data to pass without Common Use Cases inspection, we just gave anyone who wants to bypass Caveats our inspection a “get out of jail free” card. Summary If we “fail closed” and block encoded data, we just blocked all legitimate uses of that encoding scheme. DEALING WITH ENCODED STREAMS Strategy 2: Ignore the Data

SCREAM Rob King Rather than pretending that the data is not encoded,

Introduction we could go one step further, and detect that the data is The Problem encoded. The Base64 Algorithm Regular Expressions Once we’ve detected that the data is encoded, we can The Algorithm simply drop the stream. Ways of Solving the Problem Encoding Operations Advantages: Performance Stops burdening with inspection engine with data we Expression Optimization know we can’t inspect. Performance Analysis Disadvantages: Implementation and Usage If we “fail open” and allow encoded data to pass without Common Use Cases inspection, we just gave anyone who wants to bypass Caveats our inspection a “get out of jail free” card. Summary If we “fail closed” and block encoded data, we just blocked all legitimate uses of that encoding scheme. DEALING WITH ENCODING STREAMS Strategy 3: Store and Forward

SCREAM

Rob King We could buffer the entirety of an encoded stream. Introduction The Problem Once we’ve buffered the whole stream, we can decode The Base64 Algorithm it, inspect it, reencode it, and send it on its way. Regular Expressions

The Algorithm Advantages: Ways of Solving the Problem We get the complete power of our inspection engine. Encoding Operations

Performance Disadvantages: Expression Optimization Latency becomes unbounded. Performance Analysis Resource usage becomes unbounded.

Implementation We have to modify the engine for every encoding we and Usage need to inspect. Common Use Cases Caveats These advantages and disadvantages apply roughly Summary equally to any streaming decoder. DEALING WITH ENCODING STREAMS Strategy 3: Store and Forward

SCREAM

Rob King We could buffer the entirety of an encoded stream. Introduction The Problem Once we’ve buffered the whole stream, we can decode The Base64 Algorithm it, inspect it, reencode it, and send it on its way. Regular Expressions

The Algorithm Advantages: Ways of Solving the Problem We get the complete power of our inspection engine. Encoding Operations

Performance Disadvantages: Expression Optimization Latency becomes unbounded. Performance Analysis Resource usage becomes unbounded.

Implementation We have to modify the engine for every encoding we and Usage need to inspect. Common Use Cases Caveats These advantages and disadvantages apply roughly Summary equally to any streaming decoder. DEALING WITH ENCODING STREAMS Strategy 3: Store and Forward

SCREAM

Rob King We could buffer the entirety of an encoded stream. Introduction The Problem Once we’ve buffered the whole stream, we can decode The Base64 Algorithm it, inspect it, reencode it, and send it on its way. Regular Expressions

The Algorithm Advantages: Ways of Solving the Problem We get the complete power of our inspection engine. Encoding Operations

Performance Disadvantages: Expression Optimization Latency becomes unbounded. Performance Analysis Resource usage becomes unbounded.

Implementation We have to modify the engine for every encoding we and Usage need to inspect. Common Use Cases Caveats These advantages and disadvantages apply roughly Summary equally to any streaming decoder. DEALING WITH ENCODING STREAMS Strategy 3: Store and Forward

SCREAM

Rob King We could buffer the entirety of an encoded stream. Introduction The Problem Once we’ve buffered the whole stream, we can decode The Base64 Algorithm it, inspect it, reencode it, and send it on its way. Regular Expressions

The Algorithm Advantages: Ways of Solving the Problem We get the complete power of our inspection engine. Encoding Operations

Performance Disadvantages: Expression Optimization Latency becomes unbounded. Performance Analysis Resource usage becomes unbounded.

Implementation We have to modify the engine for every encoding we and Usage need to inspect. Common Use Cases Caveats These advantages and disadvantages apply roughly Summary equally to any streaming decoder. DEALING WITH ENCODING STREAMS Strategy 3: Store and Forward

SCREAM

Rob King We could buffer the entirety of an encoded stream. Introduction The Problem Once we’ve buffered the whole stream, we can decode The Base64 Algorithm it, inspect it, reencode it, and send it on its way. Regular Expressions

The Algorithm Advantages: Ways of Solving the Problem We get the complete power of our inspection engine. Encoding Operations

Performance Disadvantages: Expression Optimization Latency becomes unbounded. Performance Analysis Resource usage becomes unbounded.

Implementation We have to modify the engine for every encoding we and Usage need to inspect. Common Use Cases Caveats These advantages and disadvantages apply roughly Summary equally to any streaming decoder. DEALING WITH ENCODED STREAMS Strategy 4: Modify the Pattern

SCREAM

Rob King

Introduction Instead of decoding the data, we could encode the The Problem The Base64 pattern. Algorithm Regular Expressions Advantages: The Algorithm Ways of Solving the We get (most?) of the power of our inspection engine. Problem Encoding Operations There is no performance penalty for decoding.

Performance Resource usage and latency are bounded. Expression Optimization We need not buffer or store context. Performance Analysis Disadvantages: Implementation A different transform must be written for each scheme. and Usage Common Use Cases Might not be possible for some schemes or patterns. Caveats False positives may become more common. Summary DEALING WITH ENCODED STREAMS Strategy 4: Modify the Pattern

SCREAM

Rob King

Introduction Instead of decoding the data, we could encode the The Problem The Base64 pattern. Algorithm Regular Expressions Advantages: The Algorithm Ways of Solving the We get (most?) of the power of our inspection engine. Problem Encoding Operations There is no performance penalty for decoding.

Performance Resource usage and latency are bounded. Expression Optimization We need not buffer or store context. Performance Analysis Disadvantages: Implementation A different transform must be written for each scheme. and Usage Common Use Cases Might not be possible for some schemes or patterns. Caveats False positives may become more common. Summary DEALING WITH ENCODED STREAMS Strategy 4: Modify the Pattern

SCREAM

Rob King

Introduction Instead of decoding the data, we could encode the The Problem The Base64 pattern. Algorithm Regular Expressions Advantages: The Algorithm Ways of Solving the We get (most?) of the power of our inspection engine. Problem Encoding Operations There is no performance penalty for decoding.

Performance Resource usage and latency are bounded. Expression Optimization We need not buffer or store context. Performance Analysis Disadvantages: Implementation A different transform must be written for each scheme. and Usage Common Use Cases Might not be possible for some schemes or patterns. Caveats False positives may become more common. Summary BASE64 In which we discover the usefulness of large radices or “Maybe the Sumerians were right after all...”

SCREAM

Rob King

Introduction The Problem The Base64 Base64 is an Internet standard for encoding arbitrary Algorithm Regular Expressions (usually binary) data using only printable characters The Algorithm common in many character sets. Ways of Solving the Problem Encoding Operations Multiple minor variants, but the most common is Performance defined in RFC4648. Expression Optimization Performance Base64 is used in numerous situations, essentially Analysis whenever binary data needs to be encoded in a Implementation and Usage printable form. Common Use Cases Caveats

Summary BASE64 Base64 Encoding and

SCREAM

Rob King The most common application of Base64 is in the encoding of attachments to email messages. Introduction The Problem The Base64 Algorithm Regular Expressions

The Algorithm Ways of Solving the Problem Encoding Operations

Performance Expression Optimization Performance Analysis

Implementation and Usage Common Use Cases Caveats

Summary BASE64 Base64 Input

SCREAM Rob King Base64 expects input as a series of eight-bit octets.

Introduction Every three octets are grouped together into a The Problem The Base64 collection of 24-bits. Algorithm Regular Expressions These 24-bits are then split into four six-bit sextets. The Algorithm Ways of Solving the Each sextet is used as a big-endian index into the Problem Encoding Operations zero-based array that is the Base64 alphabet. Performance Expression Optimization Performance Analysis

Implementation and Usage Common Use Cases Caveats If the input is not evenly divisible by three, it is Summary right-padded by one or two octets of zeroes, and one or two “=” symbols are appended to the output. BASE64 Base64 Input

SCREAM Rob King Base64 expects input as a series of eight-bit octets.

Introduction Every three octets are grouped together into a The Problem The Base64 collection of 24-bits. Algorithm Regular Expressions These 24-bits are then split into four six-bit sextets. The Algorithm Ways of Solving the Each sextet is used as a big-endian index into the Problem Encoding Operations zero-based array that is the Base64 alphabet. Performance Expression Optimization Performance Analysis

Implementation and Usage Common Use Cases Caveats If the input is not evenly divisible by three, it is Summary right-padded by one or two octets of zeroes, and one or two “=” symbols are appended to the output. BASE64 Base64 Input

SCREAM Rob King Base64 expects input as a series of eight-bit octets.

Introduction Every three octets are grouped together into a The Problem The Base64 collection of 24-bits. Algorithm Regular Expressions These 24-bits are then split into four six-bit sextets. The Algorithm Ways of Solving the Each sextet is used as a big-endian index into the Problem Encoding Operations zero-based array that is the Base64 alphabet. Performance Expression Optimization Performance Analysis

Implementation and Usage Common Use Cases Caveats If the input is not evenly divisible by three, it is Summary right-padded by one or two octets of zeroes, and one or two “=” symbols are appended to the output. BASE64 Base64 Input

SCREAM Rob King Base64 expects input as a series of eight-bit octets.

Introduction Every three octets are grouped together into a The Problem The Base64 collection of 24-bits. Algorithm Regular Expressions These 24-bits are then split into four six-bit sextets. The Algorithm Ways of Solving the Each sextet is used as a big-endian index into the Problem Encoding Operations zero-based array that is the Base64 alphabet. Performance Expression Optimization Performance Analysis

Implementation and Usage Common Use Cases Caveats If the input is not evenly divisible by three, it is Summary right-padded by one or two octets of zeroes, and one or two “=” symbols are appended to the output. BASE64 Base64 Input

SCREAM Rob King Base64 expects input as a series of eight-bit octets.

Introduction Every three octets are grouped together into a The Problem The Base64 collection of 24-bits. Algorithm Regular Expressions These 24-bits are then split into four six-bit sextets. The Algorithm Ways of Solving the Each sextet is used as a big-endian index into the Problem Encoding Operations zero-based array that is the Base64 alphabet. Performance Expression Optimization Performance Analysis

Implementation and Usage Common Use Cases Caveats If the input is not evenly divisible by three, it is Summary right-padded by one or two octets of zeroes, and one or two “=” symbols are appended to the output. BASE64 Base64 Input

SCREAM Rob King Base64 expects input as a series of eight-bit octets.

Introduction Every three octets are grouped together into a The Problem The Base64 collection of 24-bits. Algorithm Regular Expressions These 24-bits are then split into four six-bit sextets. The Algorithm Ways of Solving the Each sextet is used as a big-endian index into the Problem Encoding Operations zero-based array that is the Base64 alphabet. Performance Expression Optimization Performance Analysis

Implementation and Usage Common Use Cases Caveats If the input is not evenly divisible by three, it is Summary right-padded by one or two octets of zeroes, and one or two “=” symbols are appended to the output. BASE64 The Base64 Alphabet

SCREAM

Rob King

Introduction The Problem The Base64 Algorithm Regular Expressions 0 1 2 3 4 5 6 7 8 9 A B C D E F The Algorithm 0 A B C D E F G H I J K L M N O P Ways of Solving the Problem 1 Q R S T U V W X Y Z a b c d e f Encoding Operations 2 g h i j k l m n o p q r s t u v Performance 3 w x y z 0 1 2 3 4 5 6 7 8 9 + / Expression The “=” sign is used for padding. Optimization Performance Analysis

Implementation and Usage Common Use Cases Caveats

Summary BASE64 Position Dependence

SCREAM

Rob King

Introduction The Problem Base64 is position dependent. The Base64 Algorithm In general, any given string will be encoded in one of Regular Expressions

The Algorithm three different ways, depending on its offset into the Ways of Solving the Problem input. Encoding Operations

Performance This is what makes the encoding of patterns so hard. Expression Optimization Performance Encoding of “foo” at Three Different Offsets Analysis Offset 0 Zm9v Implementation and Usage Offset 1 [159BFJNRVZdhlptx]mb2[+/8-9] Common Use Cases Caveats Offset 2 [2GWm]Zvb[+/0-9w-z] Summary REGULAR EXPRESSIONS Choosing a Pattern Language

SCREAM

Rob King

Introduction The Problem The Base64 Algorithm Our choice of pattern language controls the overall Regular Expressions complexity of patterns for which we can search. The Algorithm Ways of Solving the Problem If we were just looking for static strings, we wouldn’t Encoding Operations need a new tool - just use grep. Performance Expression Optimization Regular expressions provide a good balance of ease of Performance Analysis implementation, expressive power, and common Implementation availability. and Usage Common Use Cases Caveats

Summary REGULAR EXPRESSIONS (Automata Theory) ∩ (Formal Language Theory) ≈ Regular Expressions

SCREAM

Rob King

Introduction The Problem The Base64 Formalized by Stephen Kleene in 1956. Algorithm Regular Expressions Ken Thompson incorporated them as a useful pattern The Algorithm Ways of Solving the matching tool into his version of the QED editor for Problem Encoding Operations MIT’s CTSS timesharing system. Performance This later influenced Ken Thompson’s implementation Expression Optimization Performance of ed for UNIX. Analysis

Implementation From UNIX, regular expressions spread around the and Usage Common Use Cases world. Caveats

Summary SUPPORTED REGULAR EXPRESSION OPERATIONS

SCREAM

Rob King

Introduction The Problem The Base64 Algorithm Regular Expressions Character matches (e.g. “a”) The Algorithm Ways of Solving the Concatenation (e.g. “ab”) Problem Encoding Operations Alternation (e.g. “(ab|cd)”) Performance Expression Character classes and inverse classes (e.g. “[0 − 9]”) Optimization Performance Analysis Kleene closures (e.g. “A*C”) Implementation and Usage Common Use Cases Caveats

Summary UNSUPPORTED REGULAR EXPRESSION OPERATIONS

SCREAM

Rob King

Introduction The Problem The Base64 Algorithm Regular Expressions Backreferences and captures The Algorithm Ways of Solving the Problem Variable-length repetition Encoding Operations

Performance Left and right anchors Expression Optimization Just about everything else Performance Analysis

Implementation and Usage Common Use Cases Caveats

Summary RESTATEMENT OF PURPOSE

SCREAM

Rob King

Introduction Come up with a way to transform an arbitrary regular The Problem The Base64 expression such that it will match its input when that Algorithm Regular Expressions input has been encoded using the Base64 algorithm. The Algorithm Ways of Solving the Do this transformation in such a way that the regular Problem Encoding Operations expression does not grow too large. Performance Expression Do this transformation in such a way that not too much Optimization Performance information is lost. Analysis

Implementation Do this transformation in such a way that the and Usage Common Use Cases expression will match regardless of the pattern’s offset Caveats from the beginning of input. Summary RESTATEMENT OF PURPOSE

SCREAM

Rob King

Introduction Come up with a way to transform an arbitrary regular The Problem The Base64 expression such that it will match its input when that Algorithm Regular Expressions input has been encoded using the Base64 algorithm. The Algorithm Ways of Solving the Do this transformation in such a way that the regular Problem Encoding Operations expression does not grow too large. Performance Expression Do this transformation in such a way that not too much Optimization Performance information is lost. Analysis

Implementation Do this transformation in such a way that the and Usage Common Use Cases expression will match regardless of the pattern’s offset Caveats from the beginning of input. Summary RESTATEMENT OF PURPOSE

SCREAM

Rob King

Introduction Come up with a way to transform an arbitrary regular The Problem The Base64 expression such that it will match its input when that Algorithm Regular Expressions input has been encoded using the Base64 algorithm. The Algorithm Ways of Solving the Do this transformation in such a way that the regular Problem Encoding Operations expression does not grow too large. Performance Expression Do this transformation in such a way that not too much Optimization Performance information is lost. Analysis

Implementation Do this transformation in such a way that the and Usage Common Use Cases expression will match regardless of the pattern’s offset Caveats from the beginning of input. Summary RESTATEMENT OF PURPOSE

SCREAM

Rob King

Introduction Come up with a way to transform an arbitrary regular The Problem The Base64 expression such that it will match its input when that Algorithm Regular Expressions input has been encoded using the Base64 algorithm. The Algorithm Ways of Solving the Do this transformation in such a way that the regular Problem Encoding Operations expression does not grow too large. Performance Expression Do this transformation in such a way that not too much Optimization Performance information is lost. Analysis

Implementation Do this transformation in such a way that the and Usage Common Use Cases expression will match regardless of the pattern’s offset Caveats from the beginning of input. Summary WAYS OF SOLVING THE PROBLEM The Right Way

SCREAM

Rob King

Introduction Convert the regular expression into a nondeterministic The Problem The Base64 Algorithm finite state automaton using Thompson’s algorithm, Regular Expressions then convert the NFA to a deterministic finite state The Algorithm Ways of Solving the automaton using the powerset construction algorithm, Problem Encoding Operations then transform the DFA into a directed acyclic graph, Performance then perform graph reductions until the graph is in a Expression Optimization minimal form, transform the minimal form using graph Performance Analysis transformations, then reduce again, then serialize the Implementation and Usage graph as a regular expression. Common Use Cases Caveats This is hard, and I’m lazy. Summary WAYS OF SOLVING THE PROBLEM The Right Way

SCREAM

Rob King

Introduction Convert the regular expression into a nondeterministic The Problem The Base64 Algorithm finite state automaton using Thompson’s algorithm, Regular Expressions then convert the NFA to a deterministic finite state The Algorithm Ways of Solving the automaton using the powerset construction algorithm, Problem Encoding Operations then transform the DFA into a directed acyclic graph, Performance then perform graph reductions until the graph is in a Expression Optimization minimal form, transform the minimal form using graph Performance Analysis transformations, then reduce again, then serialize the Implementation and Usage graph as a regular expression. Common Use Cases Caveats This is hard, and I’m lazy. Summary WAYS OF SOLVING THE PROBLEM The Right Way

SCREAM

Rob King

Introduction Convert the regular expression into a nondeterministic The Problem The Base64 Algorithm finite state automaton using Thompson’s algorithm, Regular Expressions then convert the NFA to a deterministic finite state The Algorithm Ways of Solving the automaton using the powerset construction algorithm, Problem Encoding Operations then transform the DFA into a directed acyclic graph, Performance then perform graph reductions until the graph is in a Expression Optimization minimal form, transform the minimal form using graph Performance Analysis transformations, then reduce again, then serialize the Implementation and Usage graph as a regular expression. Common Use Cases Caveats This is hard, and I’m lazy. Summary WAYS OF SOLVING THE PROBLEM The Wrong Way

SCREAM

Rob King Enumerate all possible matching strings for a regular Introduction The Problem expression. The Base64 Algorithm Regular Expressions Encode these strings and concatenate them inside an

The Algorithm alternating regular expression. Ways of Solving the Problem This would definitely work, with one problem... Encoding Operations Performance Regular expressions with Kleene closures can match Expression Optimization an infinite number of strings. Performance Analysis Enumerating an infinite number of strings can take a Implementation and Usage really long time. Common Use Cases Caveats Even the trivial regular expression “....” would match Summary over four billion strings. WAYS OF SOLVING THE PROBLEM The Wrong Way

SCREAM

Rob King Enumerate all possible matching strings for a regular Introduction The Problem expression. The Base64 Algorithm Regular Expressions Encode these strings and concatenate them inside an

The Algorithm alternating regular expression. Ways of Solving the Problem This would definitely work, with one problem... Encoding Operations Performance Regular expressions with Kleene closures can match Expression Optimization an infinite number of strings. Performance Analysis Enumerating an infinite number of strings can take a Implementation and Usage really long time. Common Use Cases Caveats Even the trivial regular expression “....” would match Summary over four billion strings. WAYS OF SOLVING THE PROBLEM The Wrong Way

SCREAM

Rob King Enumerate all possible matching strings for a regular Introduction The Problem expression. The Base64 Algorithm Regular Expressions Encode these strings and concatenate them inside an

The Algorithm alternating regular expression. Ways of Solving the Problem This would definitely work, with one problem... Encoding Operations Performance Regular expressions with Kleene closures can match Expression Optimization an infinite number of strings. Performance Analysis Enumerating an infinite number of strings can take a Implementation and Usage really long time. Common Use Cases Caveats Even the trivial regular expression “....” would match Summary over four billion strings. WAYS OF SOLVING THE PROBLEM The Wrong Way

SCREAM

Rob King Enumerate all possible matching strings for a regular Introduction The Problem expression. The Base64 Algorithm Regular Expressions Encode these strings and concatenate them inside an

The Algorithm alternating regular expression. Ways of Solving the Problem This would definitely work, with one problem... Encoding Operations Performance Regular expressions with Kleene closures can match Expression Optimization an infinite number of strings. Performance Analysis Enumerating an infinite number of strings can take a Implementation and Usage really long time. Common Use Cases Caveats Even the trivial regular expression “....” would match Summary over four billion strings. WAYS OF SOLVING THE PROBLEM The Wrong Way

SCREAM

Rob King Enumerate all possible matching strings for a regular Introduction The Problem expression. The Base64 Algorithm Regular Expressions Encode these strings and concatenate them inside an

The Algorithm alternating regular expression. Ways of Solving the Problem This would definitely work, with one problem... Encoding Operations Performance Regular expressions with Kleene closures can match Expression Optimization an infinite number of strings. Performance Analysis Enumerating an infinite number of strings can take a Implementation and Usage really long time. Common Use Cases Caveats Even the trivial regular expression “....” would match Summary over four billion strings. WAYS OF SOLVING THE PROBLEM The Wrong Way

SCREAM

Rob King Enumerate all possible matching strings for a regular Introduction The Problem expression. The Base64 Algorithm Regular Expressions Encode these strings and concatenate them inside an

The Algorithm alternating regular expression. Ways of Solving the Problem This would definitely work, with one problem... Encoding Operations Performance Regular expressions with Kleene closures can match Expression Optimization an infinite number of strings. Performance Analysis Enumerating an infinite number of strings can take a Implementation and Usage really long time. Common Use Cases Caveats Even the trivial regular expression “....” would match Summary over four billion strings. WAYS OF SOLVING THE PROBLEM The Wrong Way

SCREAM

Rob King Enumerate all possible matching strings for a regular Introduction The Problem expression. The Base64 Algorithm Regular Expressions Encode these strings and concatenate them inside an

The Algorithm alternating regular expression. Ways of Solving the Problem This would definitely work, with one problem... Encoding Operations Performance Regular expressions with Kleene closures can match Expression Optimization an infinite number of strings. Performance Analysis Enumerating an infinite number of strings can take a Implementation and Usage really long time. Common Use Cases Caveats Even the trivial regular expression “....” would match Summary over four billion strings. WAYS OF SOLVING THE PROBLEM The Wrong Way, But Faster!

SCREAM

Rob King

Introduction The Problem The Base64 Algorithm Regular Expressions Enumerating all possible matching strings would, in

The Algorithm theory, give us the correct answer. Ways of Solving the Problem Instead of enumerating all possible strings and Encoding Operations

Performance encoding them, what if we encoded the operations? Expression Optimization In other words, what if we enumerated only what could Performance Analysis match a given operation at a given point in time? Implementation and Usage Common Use Cases Caveats

Summary WAYS OF SOLVING THE PROBLEM The Wrong Way, But Faster!

SCREAM

Rob King

Introduction The Problem The Base64 Algorithm Regular Expressions Enumerating all possible matching strings would, in

The Algorithm theory, give us the correct answer. Ways of Solving the Problem Instead of enumerating all possible strings and Encoding Operations

Performance encoding them, what if we encoded the operations? Expression Optimization In other words, what if we enumerated only what could Performance Analysis match a given operation at a given point in time? Implementation and Usage Common Use Cases Caveats

Summary WAYS OF SOLVING THE PROBLEM The Wrong Way, But Faster!

SCREAM

Rob King

Introduction The Problem The Base64 Algorithm Regular Expressions Enumerating all possible matching strings would, in

The Algorithm theory, give us the correct answer. Ways of Solving the Problem Instead of enumerating all possible strings and Encoding Operations

Performance encoding them, what if we encoded the operations? Expression Optimization In other words, what if we enumerated only what could Performance Analysis match a given operation at a given point in time? Implementation and Usage Common Use Cases Caveats

Summary WAYS OF SOLVING THE PROBLEM The Wrong Way, But Faster!

SCREAM

Rob King

Introduction The Problem The Base64 Algorithm Regular Expressions Enumerating all possible matching strings would, in

The Algorithm theory, give us the correct answer. Ways of Solving the Problem Instead of enumerating all possible strings and Encoding Operations

Performance encoding them, what if we encoded the operations? Expression Optimization In other words, what if we enumerated only what could Performance Analysis match a given operation at a given point in time? Implementation and Usage Common Use Cases Caveats

Summary THE ALGORITHM In a nutshell...

SCREAM

Rob King Create a list of regular expressions that match a Introduction The Problem fixed-length string. The Base64 Algorithm Enumerate the bitstrings matched by each of these Regular Expressions

The Algorithm expressions. Ways of Solving the Problem Break each of these bitstrings into six-bit units. Encoding Operations Performance Encode each of these units. Expression Optimization Performance Treat each of these encoded strings as a branch in an Analysis n-way alternation in a regular expression. Implementation and Usage Optimize the expression. Common Use Cases Caveats There is special handling for the two “meta” operations: Summary alternations and closures. THE ALGORITHM In a nutshell...

SCREAM

Rob King Create a list of regular expressions that match a Introduction The Problem fixed-length string. The Base64 Algorithm Enumerate the bitstrings matched by each of these Regular Expressions

The Algorithm expressions. Ways of Solving the Problem Break each of these bitstrings into six-bit units. Encoding Operations Performance Encode each of these units. Expression Optimization Performance Treat each of these encoded strings as a branch in an Analysis n-way alternation in a regular expression. Implementation and Usage Optimize the expression. Common Use Cases Caveats There is special handling for the two “meta” operations: Summary alternations and closures. THE ALGORITHM In a nutshell...

SCREAM

Rob King Create a list of regular expressions that match a Introduction The Problem fixed-length string. The Base64 Algorithm Enumerate the bitstrings matched by each of these Regular Expressions

The Algorithm expressions. Ways of Solving the Problem Break each of these bitstrings into six-bit units. Encoding Operations Performance Encode each of these units. Expression Optimization Performance Treat each of these encoded strings as a branch in an Analysis n-way alternation in a regular expression. Implementation and Usage Optimize the expression. Common Use Cases Caveats There is special handling for the two “meta” operations: Summary alternations and closures. THE ALGORITHM In a nutshell...

SCREAM

Rob King Create a list of regular expressions that match a Introduction The Problem fixed-length string. The Base64 Algorithm Enumerate the bitstrings matched by each of these Regular Expressions

The Algorithm expressions. Ways of Solving the Problem Break each of these bitstrings into six-bit units. Encoding Operations Performance Encode each of these units. Expression Optimization Performance Treat each of these encoded strings as a branch in an Analysis n-way alternation in a regular expression. Implementation and Usage Optimize the expression. Common Use Cases Caveats There is special handling for the two “meta” operations: Summary alternations and closures. THE ALGORITHM In a nutshell...

SCREAM

Rob King Create a list of regular expressions that match a Introduction The Problem fixed-length string. The Base64 Algorithm Enumerate the bitstrings matched by each of these Regular Expressions

The Algorithm expressions. Ways of Solving the Problem Break each of these bitstrings into six-bit units. Encoding Operations Performance Encode each of these units. Expression Optimization Performance Treat each of these encoded strings as a branch in an Analysis n-way alternation in a regular expression. Implementation and Usage Optimize the expression. Common Use Cases Caveats There is special handling for the two “meta” operations: Summary alternations and closures. THE ALGORITHM In a nutshell...

SCREAM

Rob King Create a list of regular expressions that match a Introduction The Problem fixed-length string. The Base64 Algorithm Enumerate the bitstrings matched by each of these Regular Expressions

The Algorithm expressions. Ways of Solving the Problem Break each of these bitstrings into six-bit units. Encoding Operations Performance Encode each of these units. Expression Optimization Performance Treat each of these encoded strings as a branch in an Analysis n-way alternation in a regular expression. Implementation and Usage Optimize the expression. Common Use Cases Caveats There is special handling for the two “meta” operations: Summary alternations and closures. THE ALGORITHM In a nutshell...

SCREAM

Rob King Create a list of regular expressions that match a Introduction The Problem fixed-length string. The Base64 Algorithm Enumerate the bitstrings matched by each of these Regular Expressions

The Algorithm expressions. Ways of Solving the Problem Break each of these bitstrings into six-bit units. Encoding Operations Performance Encode each of these units. Expression Optimization Performance Treat each of these encoded strings as a branch in an Analysis n-way alternation in a regular expression. Implementation and Usage Optimize the expression. Common Use Cases Caveats There is special handling for the two “meta” operations: Summary alternations and closures. THE ALGORITHM In a nutshell...

SCREAM

Rob King Create a list of regular expressions that match a Introduction The Problem fixed-length string. The Base64 Algorithm Enumerate the bitstrings matched by each of these Regular Expressions

The Algorithm expressions. Ways of Solving the Problem Break each of these bitstrings into six-bit units. Encoding Operations Performance Encode each of these units. Expression Optimization Performance Treat each of these encoded strings as a branch in an Analysis n-way alternation in a regular expression. Implementation and Usage Optimize the expression. Common Use Cases Caveats There is special handling for the two “meta” operations: Summary alternations and closures. ENCODING OPERATIONS The Fixed-Length Rule

SCREAM

Rob King

Introduction The Problem The Base64 Algorithm Regular Expressions Non-fixed-length operations leave ambiguity as to what

The Algorithm bits to place in a given six-bit unit. Ways of Solving the Problem The goal of this phase of the algorithm is to get a Encoding Operations

Performance collection of regular expressions that match fixed-length Expression Optimization strings. Performance Analysis First, let’s go over how to encode the atomic operations. Implementation and Usage Common Use Cases Caveats

Summary ENCODING OPERATIONS Single Characters

SCREAM

Rob King

Introduction The Problem The Base64 Algorithm Regular Expressions This is a fairly obvious - a single character is encoded as the

The Algorithm bitstring for that character. Ways of Solving the Problem Encoding Operations

Performance Expression Optimization Performance Analysis

Implementation and Usage Common Use Cases Caveats

Summary ENCODING OPERATIONS Character Classes and Inverted Character Classes

SCREAM

Rob King Character classes are stored as a list of all characters included in the class. This is not the most efficient possible Introduction The Problem choice in terms of memory usage, but its implementation is The Base64 Algorithm simpler and it is often faster. Regular Expressions

The Algorithm Ways of Solving the Problem Encoding Operations

Performance Expression Optimization Performance Analysis

Implementation and Usage Common Use Cases Caveats

Summary Inverted character classes simply store the complement of the list of indicated characters. ENCODING OPERATIONS Wildcards

SCREAM

Rob King

Introduction The Problem The Base64 Algorithm Regular Expressions

The Algorithm Wildcards are simply encoded as a character class with 256 Ways of Solving the Problem entries. In the underlying implementation they are encoded Encoding Operations as a special to save space, but from the point of view Performance Expression of the algorithm there is no difference. Optimization Performance Analysis

Implementation and Usage Common Use Cases Caveats

Summary ENCODING OPERATIONS Concatenation

SCREAM

Rob King

Introduction Single characters, character classes, and wildcards are The Problem The Base64 Algorithm treated as the building blocks of lists. Regular Expressions Each fixed-length expression is simply a list of these The Algorithm Ways of Solving the operations. Problem Encoding Operations Therefore, concatenation is modelled implicitly by the Performance Expression ordering of these operations in the list. Optimization Performance Analysis

Implementation and Usage Common Use Cases Caveats

Summary ENCODING ALTERNATIONS Alternations and Fixed Lengths

SCREAM

Rob King

Introduction The Problem The Base64 Algorithm Regular Expressions It was stated earlier that the algorithm deals only with The Algorithm Ways of Solving the regular expressions that match only fixed-length strings. Problem Encoding Operations Alternations can easily violate this: if one branch of an Performance Expression alternation has more character matches than the other, Optimization Performance the expression overall is not fixed length. Analysis

Implementation and Usage Common Use Cases Caveats

Summary ENCODING ALTERNATIONS Alternations and Fixed Lengths

SCREAM

Rob King

Introduction The Problem The Base64 Algorithm Regular Expressions It was stated earlier that the algorithm deals only with The Algorithm Ways of Solving the regular expressions that match only fixed-length strings. Problem Encoding Operations Alternations can easily violate this: if one branch of an Performance Expression alternation has more character matches than the other, Optimization Performance the expression overall is not fixed length. Analysis

Implementation and Usage Common Use Cases Caveats

Summary ENCODING ALTERNATIONS Alternations and Fixed Lengths

SCREAM

Rob King

Introduction The Problem The Base64 Algorithm Regular Expressions It was stated earlier that the algorithm deals only with The Algorithm Ways of Solving the regular expressions that match only fixed-length strings. Problem Encoding Operations Alternations can easily violate this: if one branch of an Performance Expression alternation has more character matches than the other, Optimization Performance the expression overall is not fixed length. Analysis

Implementation and Usage Common Use Cases Caveats

Summary ENCODING ALTERNATIONS Alternations as Directed Acyclic Graphs

SCREAM

Rob King

Introduction Alternations turn our simple lists of operations into The Problem The Base64 directed acyclic graphs. Algorithm Regular Expressions For example, below is the representation of the regular The Algorithm Ways of Solving the expression “A(B|(C|D)E)F”. Problem Encoding Operations

Performance Expression Optimization Performance Analysis

Implementation and Usage Common Use Cases Caveats

Summary ENCODING ALTERNATIONS Making Alternations Fixed-Length

SCREAM

Rob King We can easily create a list of fixed-length expressions by enumerating all topological sorts of the graph. Introduction The Problem For each distinct topological sort, we can create a list of The Base64 Algorithm fixed-length operations. Regular Expressions The Algorithm This reduces an alternation to a fixed-length structure, Ways of Solving the Problem and can be performed recursively, resulting in a list of Encoding Operations

Performance expressions, each of a fixed-length. Expression Optimization Performance Analysis

Implementation and Usage Common Use Cases Caveats

Summary ENCODING ALTERNATIONS The Mythical End

SCREAM Rob King By adding an explicit “end of expression”

Introduction pseudo-operation to the end of every expression, we The Problem can guarantee that all expressions have a shared tail. The Base64 Algorithm Regular Expressions We can also add some number of explicit “start of The Algorithm expression” pseudo-operations from every entry point Ways of Solving the Problem Encoding Operations into the expression. Performance These operations does not actually match anything; Expression Optimization they exist solely for ease of implementation. Performance Analysis

Implementation and Usage Common Use Cases Caveats

Summary ENCODING ALTERNATIONS Encoding Each Fixed-Length Expression

SCREAM

Rob King

Introduction By finding a path from every START node to the END The Problem node, we can produce a list of fixed-length expressions. The Base64 Algorithm Regular Expressions Note that this is essentially equivalent of the list of The Algorithm topological sorts, but provides the storage advantage of Ways of Solving the Problem Encoding Operations a shared tail. Performance Implementation is also considerably easier and Expression Optimization performs much better, since we don’t need to Performance Analysis implement a general topological sort; rather we can Implementation and Usage simply maintain a list of all start nodes and follow the Common Use Cases Caveats only eminating edge from every node until we reach the Summary end. ENCODING ALTERNATIONS Recursive Application

SCREAM Rob King We can apply this recursively, to build up a list of Introduction fixed-length expressions from a regular expression, The Problem The Base64 even with nested alternations. Algorithm Regular Expressions Each of these fixed length expressions can be encoded The Algorithm Ways of Solving the easily using the methods for the fixed-length operations Problem Encoding Operations above. Performance Expression Optimization Performance Analysis

Implementation and Usage Common Use Cases Caveats

Summary ENCODING EXPRESSIONS WITHOUT CLOSURES Create a List of Fixed Length Bit Expressions

SCREAM

Rob King

Introduction The Problem The Base64 Using the algorithm above, we can create a list of Algorithm Regular Expressions fixed-length expressions consisting of eight bit units (for The Algorithm single characters) or lists of eight bit units (for character Ways of Solving the Problem Encoding Operations classes and wildcards).

Performance Expression Optimization Performance Analysis

Implementation and Usage Common Use Cases Caveats

Summary ENCODING EXPRESSIONS WITHOUT CLOSURES Extract Six Bits at a Time

SCREAM

Rob King

Introduction The Problem We now use a consumer function to walk through each The Base64 Algorithm fixed-length expression, extracting six bits at a time. Regular Expressions The Algorithm This produces a new list of six-bit units. Ways of Solving the Problem Encoding Operations When the consumer function must get bits from a

Performance single character and a character class, or two classes, Expression Optimization a list of six bit units is placed in the result list. Performance Analysis

Implementation and Usage Common Use Cases Caveats

Summary ENCODING EXPRESSIONS WITHOUT CLOSURES The Initial and Final Bits

SCREAM

Rob King

Introduction The Problem The Base64 Algorithm Since the expression is going to be examining data in a Regular Expressions streaming context, left and right anchors don’t make The Algorithm Ways of Solving the sense in most situations. Problem Encoding Operations Therefore, if any bits are needed to complete a six-bit Performance Expression unit at the beginning or the end of the expression, they Optimization Performance Analysis are treated as if a wildcard were present immediately

Implementation before or after the expression. and Usage Common Use Cases Caveats

Summary ENCODING EXPRESSIONS WITHOUT CLOSURES The Initial and Final Bits

SCREAM

Rob King

Introduction The Problem The Base64 Algorithm Since the expression is going to be examining data in a Regular Expressions streaming context, left and right anchors don’t make The Algorithm Ways of Solving the sense in most situations. Problem Encoding Operations Therefore, if any bits are needed to complete a six-bit Performance Expression unit at the beginning or the end of the expression, they Optimization Performance Analysis are treated as if a wildcard were present immediately

Implementation before or after the expression. and Usage Common Use Cases Caveats

Summary ENCODING EXPRESSIONS WITHOUT CLOSURES The Initial and Final Bits

SCREAM

Rob King

Introduction The Problem The Base64 Algorithm Since the expression is going to be examining data in a Regular Expressions streaming context, left and right anchors don’t make The Algorithm Ways of Solving the sense in most situations. Problem Encoding Operations Therefore, if any bits are needed to complete a six-bit Performance Expression unit at the beginning or the end of the expression, they Optimization Performance Analysis are treated as if a wildcard were present immediately

Implementation before or after the expression. and Usage Common Use Cases Caveats

Summary ENCODING EXPRESSIONS WITHOUT CLOSURES Encode the Expressions

SCREAM

Rob King

Introduction The Problem The Base64 Algorithm Each six bit unit is then replaced with its equivalent Regular Expressions

The Algorithm symbol in the Base64 alphabet. Ways of Solving the Problem For lists of six bit units (from character classes), a list of Encoding Operations Base64 symbols is produced. Performance Expression Optimization Performance Analysis

Implementation and Usage Common Use Cases Caveats

Summary ENCODING EXPRESSIONS WITHOUT CLOSURES The Final Expression

SCREAM

Rob King

Introduction The Problem The Base64 Algorithm The list of fixed-length encoded expressions is joined Regular Expressions into an n-way alternation, which is treated as a regular The Algorithm Ways of Solving the expression. Problem Encoding Operations This expression is first run through the optimizer, which Performance Expression recursively refactors common prefixes and suffixes. Optimization Performance Analysis The expression is then reprocessed two more times, to Implementation account for varying offsets from the beginning of input. and Usage Common Use Cases Caveats

Summary ENCODING EXPRESSIONS WITH CLOSURES Violating the Fixed Length and Acyclic Rules

SCREAM

Rob King Expressions with Kleene closures violate the fixed-length rule, since subexpressions can appear Introduction The Problem from zero to infinitely many times. The Base64 Algorithm Regular Expressions Expressions with Kleene closures violate the acylic

The Algorithm rule, since repeating expressions would become cycles Ways of Solving the Problem in the graph representation. Encoding Operations

Performance Expression Optimization Performance Analysis

Implementation and Usage Common Use Cases Caveats

Summary ENCODING EXPRESSIONS WITH CLOSURES The Easy Case

SCREAM

Rob King

Introduction The Problem The Base64 Algorithm Regular Expressions The easiest case for encoding closures is the case The Algorithm where they don’t appear at all. Ways of Solving the Problem Encoding Operations Whenever a closure is present in an expression, it acts

Performance as a virtual alternation, with one branch having at least Expression Optimization one instance of the closure, and the other branch Performance Analysis having no instances. Implementation and Usage Common Use Cases Caveats

Summary ENCODING EXPRESSIONS WITH CLOSURES Marking Closures

SCREAM

Rob King

Introduction The Problem The Base64 Algorithm Regular Expressions Closures are encoded three times, to account for The Algorithm Ways of Solving the varying offsets from the beginning of input. Problem Encoding Operations These three encoded expressions are placed in a Performance Expression special data structure that marks them as the Optimization Performance components of a closure. Analysis

Implementation and Usage Common Use Cases Caveats

Summary ENCODING EXPRESSIONS WITH CLOSURES A List of Expressions

SCREAM

Rob King

Introduction The Problem The Base64 Expressions with closures are treated as a list of Algorithm Regular Expressions expressions. The Algorithm Ways of Solving the The expressions are split such that each Problem Encoding Operations specially-encoded closure is a separate expression. Performance Lists of expressions are encoded such that the implicit Expression Optimization Performance wildcard at the beginning and ending of each Analysis expression is replaced with the appropriate values from Implementation and Usage the next or previous expression in the list. Common Use Cases Caveats

Summary EXPRESSION OPTIMIZATION Common Prefix and Suffix Refactoring

SCREAM

Rob King

Introduction The expression is treated as a list of constant-length The Problem The Base64 expressions, from each possible starting expression to Algorithm Regular Expressions the logical end. The Algorithm Ways of Solving the All of these expressions will differ only around places Problem Encoding Operations where alternations were present. Performance Expression Therefore, everything leading up to the alternation, and Optimization Performance everything after the alternation, will be identical for Analysis

Implementation various expressions. and Usage Common Use Cases We can therefore optimize the expression by Caveats refactoring all common prefixes and suffixes. Summary EXPRESSION OPTIMIZATION Common Prefix and Suffix Refactoring

SCREAM

Rob King

Introduction The expression is treated as a list of constant-length The Problem The Base64 expressions, from each possible starting expression to Algorithm Regular Expressions the logical end. The Algorithm Ways of Solving the All of these expressions will differ only around places Problem Encoding Operations where alternations were present. Performance Expression Therefore, everything leading up to the alternation, and Optimization Performance everything after the alternation, will be identical for Analysis

Implementation various expressions. and Usage Common Use Cases We can therefore optimize the expression by Caveats refactoring all common prefixes and suffixes. Summary EXPRESSION OPTIMIZATION Common Prefix and Suffix Refactoring

SCREAM

Rob King

Introduction The expression is treated as a list of constant-length The Problem The Base64 expressions, from each possible starting expression to Algorithm Regular Expressions the logical end. The Algorithm Ways of Solving the All of these expressions will differ only around places Problem Encoding Operations where alternations were present. Performance Expression Therefore, everything leading up to the alternation, and Optimization Performance everything after the alternation, will be identical for Analysis

Implementation various expressions. and Usage Common Use Cases We can therefore optimize the expression by Caveats refactoring all common prefixes and suffixes. Summary EXPRESSION OPTIMIZATION Common Prefix and Suffix Refactoring

SCREAM

Rob King

Introduction The expression is treated as a list of constant-length The Problem The Base64 expressions, from each possible starting expression to Algorithm Regular Expressions the logical end. The Algorithm Ways of Solving the All of these expressions will differ only around places Problem Encoding Operations where alternations were present. Performance Expression Therefore, everything leading up to the alternation, and Optimization Performance everything after the alternation, will be identical for Analysis

Implementation various expressions. and Usage Common Use Cases We can therefore optimize the expression by Caveats refactoring all common prefixes and suffixes. Summary EXPRESSION OPTIMIZATION Common Prefix and Suffix Refactoring

SCREAM

Rob King

Introduction The expression is treated as a list of constant-length The Problem The Base64 expressions, from each possible starting expression to Algorithm Regular Expressions the logical end. The Algorithm Ways of Solving the All of these expressions will differ only around places Problem Encoding Operations where alternations were present. Performance Expression Therefore, everything leading up to the alternation, and Optimization Performance everything after the alternation, will be identical for Analysis

Implementation various expressions. and Usage Common Use Cases We can therefore optimize the expression by Caveats refactoring all common prefixes and suffixes. Summary PERFORMANCE ANALYSIS Formal Analysis of the Naive Algorithm

SCREAM

Rob King

2 Introduction In general, the algorithm will run in O(n ) time, since The Problem The Base64 every addition of an alternation results in twice as many Algorithm Regular Expressions START nodes. The Algorithm Ways of Solving the This can of course become rapidly intractable, since Problem Encoding Operations even having 32 alternations would result in over four Performance billion START nodes. Expression Optimization Performance While we do perform some optimizations to improve the Analysis

Implementation theoretical running time, it is useful to point out that in and Usage Common Use Cases the vast majority of practical runs, expressions Caveats generally have fewer than twenty alternations. Summary PERFORMANCE ANALYSIS Formal Analysis of the Naive Algorithm

SCREAM

Rob King

2 Introduction In general, the algorithm will run in O(n ) time, since The Problem The Base64 every addition of an alternation results in twice as many Algorithm Regular Expressions START nodes. The Algorithm Ways of Solving the This can of course become rapidly intractable, since Problem Encoding Operations even having 32 alternations would result in over four Performance billion START nodes. Expression Optimization Performance While we do perform some optimizations to improve the Analysis

Implementation theoretical running time, it is useful to point out that in and Usage Common Use Cases the vast majority of practical runs, expressions Caveats generally have fewer than twenty alternations. Summary PERFORMANCE ANALYSIS Formal Analysis of the Naive Algorithm

SCREAM

Rob King

2 Introduction In general, the algorithm will run in O(n ) time, since The Problem The Base64 every addition of an alternation results in twice as many Algorithm Regular Expressions START nodes. The Algorithm Ways of Solving the This can of course become rapidly intractable, since Problem Encoding Operations even having 32 alternations would result in over four Performance billion START nodes. Expression Optimization Performance While we do perform some optimizations to improve the Analysis

Implementation theoretical running time, it is useful to point out that in and Usage Common Use Cases the vast majority of practical runs, expressions Caveats generally have fewer than twenty alternations. Summary PERFORMANCE ANALYSIS Formal Analysis of the Naive Algorithm

SCREAM

Rob King

2 Introduction In general, the algorithm will run in O(n ) time, since The Problem The Base64 every addition of an alternation results in twice as many Algorithm Regular Expressions START nodes. The Algorithm Ways of Solving the This can of course become rapidly intractable, since Problem Encoding Operations even having 32 alternations would result in over four Performance billion START nodes. Expression Optimization Performance While we do perform some optimizations to improve the Analysis

Implementation theoretical running time, it is useful to point out that in and Usage Common Use Cases the vast majority of practical runs, expressions Caveats generally have fewer than twenty alternations. Summary PERFORMANCE ANALYSIS Optimizing the Algorithm

SCREAM The primary space optimization is the shared tail of the Rob King core data structure. Outside of pathological cases, this Introduction saves considerable storage. The Problem The Base64 Algorithm The alternation processing portion of the algorithm in Regular Expressions reality only splits the expression when the two branches The Algorithm Ways of Solving the of the expression are of unequal length. This reduces Problem Encoding Operations the number of paths in many situations. Performance Expression Most of the calculations are memoized; that is, they are Optimization Performance run only once for any given input for any given position Analysis and offset. Implementation and Usage With all of these optimizations taken into account, for Common Use Cases Caveats most regular expressions in our test data set, runtimes Summary of under three minutes are found (though memory usage is roughly quadratic on the number of operations in the expression). PERFORMANCE ANALYSIS Optimizing the Algorithm

SCREAM The primary space optimization is the shared tail of the Rob King core data structure. Outside of pathological cases, this Introduction saves considerable storage. The Problem The Base64 Algorithm The alternation processing portion of the algorithm in Regular Expressions reality only splits the expression when the two branches The Algorithm Ways of Solving the of the expression are of unequal length. This reduces Problem Encoding Operations the number of paths in many situations. Performance Expression Most of the calculations are memoized; that is, they are Optimization Performance run only once for any given input for any given position Analysis and offset. Implementation and Usage With all of these optimizations taken into account, for Common Use Cases Caveats most regular expressions in our test data set, runtimes Summary of under three minutes are found (though memory usage is roughly quadratic on the number of operations in the expression). PERFORMANCE ANALYSIS Optimizing the Algorithm

SCREAM The primary space optimization is the shared tail of the Rob King core data structure. Outside of pathological cases, this Introduction saves considerable storage. The Problem The Base64 Algorithm The alternation processing portion of the algorithm in Regular Expressions reality only splits the expression when the two branches The Algorithm Ways of Solving the of the expression are of unequal length. This reduces Problem Encoding Operations the number of paths in many situations. Performance Expression Most of the calculations are memoized; that is, they are Optimization Performance run only once for any given input for any given position Analysis and offset. Implementation and Usage With all of these optimizations taken into account, for Common Use Cases Caveats most regular expressions in our test data set, runtimes Summary of under three minutes are found (though memory usage is roughly quadratic on the number of operations in the expression). PERFORMANCE ANALYSIS Optimizing the Algorithm

SCREAM The primary space optimization is the shared tail of the Rob King core data structure. Outside of pathological cases, this Introduction saves considerable storage. The Problem The Base64 Algorithm The alternation processing portion of the algorithm in Regular Expressions reality only splits the expression when the two branches The Algorithm Ways of Solving the of the expression are of unequal length. This reduces Problem Encoding Operations the number of paths in many situations. Performance Expression Most of the calculations are memoized; that is, they are Optimization Performance run only once for any given input for any given position Analysis and offset. Implementation and Usage With all of these optimizations taken into account, for Common Use Cases Caveats most regular expressions in our test data set, runtimes Summary of under three minutes are found (though memory usage is roughly quadratic on the number of operations in the expression). PERFORMANCE ANALYSIS Optimizing the Algorithm

SCREAM The primary space optimization is the shared tail of the Rob King core data structure. Outside of pathological cases, this Introduction saves considerable storage. The Problem The Base64 Algorithm The alternation processing portion of the algorithm in Regular Expressions reality only splits the expression when the two branches The Algorithm Ways of Solving the of the expression are of unequal length. This reduces Problem Encoding Operations the number of paths in many situations. Performance Expression Most of the calculations are memoized; that is, they are Optimization Performance run only once for any given input for any given position Analysis and offset. Implementation and Usage With all of these optimizations taken into account, for Common Use Cases Caveats most regular expressions in our test data set, runtimes Summary of under three minutes are found (though memory usage is roughly quadratic on the number of operations in the expression). THE IMPLEMENTATION History - Why Erlang?

SCREAM

Rob King

Introduction The Problem The Base64 Algorithm The initial version of the tool was written in Lua. Regular Expressions

The Algorithm However, the tool was quickly rewritten in Erlang. Ways of Solving the Problem Erlang provided a much cleaner method of Encoding Operations implementing shared-tail lists, memoization, and bit Performance Expression manipulation. Optimization Performance Total implementation is around 1500 lines of heavily- Analysis

Implementation commented Erlang. and Usage Common Use Cases Caveats

Summary THE IMPLEMENTATION History - Why Erlang?

SCREAM

Rob King

Introduction The Problem The Base64 Algorithm The initial version of the tool was written in Lua. Regular Expressions

The Algorithm However, the tool was quickly rewritten in Erlang. Ways of Solving the Problem Erlang provided a much cleaner method of Encoding Operations implementing shared-tail lists, memoization, and bit Performance Expression manipulation. Optimization Performance Total implementation is around 1500 lines of heavily- Analysis

Implementation commented Erlang. and Usage Common Use Cases Caveats

Summary THE IMPLEMENTATION History - Why Erlang?

SCREAM

Rob King

Introduction The Problem The Base64 Algorithm The initial version of the tool was written in Lua. Regular Expressions

The Algorithm However, the tool was quickly rewritten in Erlang. Ways of Solving the Problem Erlang provided a much cleaner method of Encoding Operations implementing shared-tail lists, memoization, and bit Performance Expression manipulation. Optimization Performance Total implementation is around 1500 lines of heavily- Analysis

Implementation commented Erlang. and Usage Common Use Cases Caveats

Summary THE IMPLEMENTATION History - Why Erlang?

SCREAM

Rob King

Introduction The Problem The Base64 Algorithm The initial version of the tool was written in Lua. Regular Expressions

The Algorithm However, the tool was quickly rewritten in Erlang. Ways of Solving the Problem Erlang provided a much cleaner method of Encoding Operations implementing shared-tail lists, memoization, and bit Performance Expression manipulation. Optimization Performance Total implementation is around 1500 lines of heavily- Analysis

Implementation commented Erlang. and Usage Common Use Cases Caveats

Summary THE IMPLEMENTATION How Erlang Helped

SCREAM

Rob King The tool’s functionality can be divided into the following Introduction stages, each of which is well-served by some part of The Problem The Base64 Erlang or the OTP: Algorithm Regular Expressions Parse the regular expression (YECC). The Algorithm Create a series of bitstrings and arrays of bitstrings Ways of Solving the Problem (Erlang’s native bitstring functionality). Encoding Operations Create a graph of bitstring nodes, and manipulate the Performance Expression graph (Erlang’s DIGRAPH module). Optimization Performance Determine unique paths through the graph (DIGRAPH). Analysis Compute the transformations on each path, in parallel Implementation and Usage (Erlang’s extremely easy process creation). Common Use Cases Combine the results and optimize them (optimization Caveats

Summary problem was very easy to express in a functional way). Serialize everything out (uhm...IO:FORMAT?). THE IMPLEMENTATION The Future

SCREAM

Rob King

Introduction The Problem The Base64 Algorithm Regular Expressions Erlang makes it very easy to parallelize computation The Algorithm Ways of Solving the across multiple threads, processors, and machines. Problem Encoding Operations The algorithm itself is very amenable to parallelization, Performance Expression and an experimental version of the tool has shown Optimization Performance significant speedups. Analysis

Implementation and Usage Common Use Cases Caveats

Summary THE IMPLEMENTATION The Future

SCREAM

Rob King

Introduction The Problem The Base64 Algorithm Regular Expressions Erlang makes it very easy to parallelize computation The Algorithm Ways of Solving the across multiple threads, processors, and machines. Problem Encoding Operations The algorithm itself is very amenable to parallelization, Performance Expression and an experimental version of the tool has shown Optimization Performance significant speedups. Analysis

Implementation and Usage Common Use Cases Caveats

Summary COMMON USE CASES Prefiltering Email

SCREAM

Rob King While a traditional store-and-forward email server is of Introduction The Problem course optimal for filtering email, b64re-based The Base64 Algorithm signatures on an IPS have been useful in prefiltering Regular Expressions email. The Algorithm Ways of Solving the Problem One common use case is filtering obviously malicious Encoding Operations email, based on shellcode detection or other Performance Expression parameters, lowering the load on the backend mail Optimization Performance server. Analysis Implementation Another interesting application is protecting and Usage Common Use Cases store-and-forward scanners from themselves. Some Caveats systems have bugs that can be triggered by specially Summary formatted ; an inline IPS can filter these attacks. COMMON USE CASES Prefiltering Email

SCREAM

Rob King While a traditional store-and-forward email server is of Introduction The Problem course optimal for filtering email, b64re-based The Base64 Algorithm signatures on an IPS have been useful in prefiltering Regular Expressions email. The Algorithm Ways of Solving the Problem One common use case is filtering obviously malicious Encoding Operations email, based on shellcode detection or other Performance Expression parameters, lowering the load on the backend mail Optimization Performance server. Analysis Implementation Another interesting application is protecting and Usage Common Use Cases store-and-forward scanners from themselves. Some Caveats systems have bugs that can be triggered by specially Summary formatted emails; an inline IPS can filter these attacks. COMMON USE CASES Prefiltering Email

SCREAM

Rob King While a traditional store-and-forward email server is of Introduction The Problem course optimal for filtering email, b64re-based The Base64 Algorithm signatures on an IPS have been useful in prefiltering Regular Expressions email. The Algorithm Ways of Solving the Problem One common use case is filtering obviously malicious Encoding Operations email, based on shellcode detection or other Performance Expression parameters, lowering the load on the backend mail Optimization Performance server. Analysis Implementation Another interesting application is protecting and Usage Common Use Cases store-and-forward scanners from themselves. Some Caveats systems have bugs that can be triggered by specially Summary formatted emails; an inline IPS can filter these attacks. COMMON USE CASES Prefiltering Email

SCREAM

Rob King While a traditional store-and-forward email server is of Introduction The Problem course optimal for filtering email, b64re-based The Base64 Algorithm signatures on an IPS have been useful in prefiltering Regular Expressions email. The Algorithm Ways of Solving the Problem One common use case is filtering obviously malicious Encoding Operations email, based on shellcode detection or other Performance Expression parameters, lowering the load on the backend mail Optimization Performance server. Analysis Implementation Another interesting application is protecting and Usage Common Use Cases store-and-forward scanners from themselves. Some Caveats systems have bugs that can be triggered by specially Summary formatted emails; an inline IPS can filter these attacks. COMMON USE CASES Data Extrusion

SCREAM

Rob King

Introduction The Problem The Base64 Algorithm Regular Expressions

The Algorithm Email is a common source of sensitive data leakage. Ways of Solving the Problem An inline IPS running signatures using this tool can be Encoding Operations

Performance used to scan for outbound sensitive information, even if Expression Optimization an external mail server is used. Performance Analysis

Implementation and Usage Common Use Cases Caveats

Summary COMMON USE CASES Data Extrusion

SCREAM

Rob King

Introduction The Problem The Base64 Algorithm Regular Expressions

The Algorithm Email is a common source of sensitive data leakage. Ways of Solving the Problem An inline IPS running signatures using this tool can be Encoding Operations

Performance used to scan for outbound sensitive information, even if Expression Optimization an external mail server is used. Performance Analysis

Implementation and Usage Common Use Cases Caveats

Summary COMMON USE CASES Data Extrusion

SCREAM

Rob King

Introduction The Problem The Base64 Algorithm Regular Expressions

The Algorithm Email is a common source of sensitive data leakage. Ways of Solving the Problem An inline IPS running signatures using this tool can be Encoding Operations

Performance used to scan for outbound sensitive information, even if Expression Optimization an external mail server is used. Performance Analysis

Implementation and Usage Common Use Cases Caveats

Summary FALSE POSITIVES Detecting the Wrong Thing

SCREAM

Rob King

Introduction Expressions processed in this way are inherently looser The Problem The Base64 than their unencoded counterparts. Why? Algorithm Regular Expressions Recall that the expression is encoded three times, to The Algorithm Ways of Solving the account for varying offsets from the beginning of input. Problem Encoding Operations The expression designed to match when its offset is Performance Expression n + 1 from the beginning of the input could match input Optimization Performance that is actually at a different modular offset. Analysis

Implementation This is mitigated somewhat if the expression can be and Usage Common Use Cases anchored by long constant strings, or if the beginning of Caveats input is known. Summary FALSE POSITIVES Detecting the Wrong Thing

SCREAM

Rob King

Introduction Expressions processed in this way are inherently looser The Problem The Base64 than their unencoded counterparts. Why? Algorithm Regular Expressions Recall that the expression is encoded three times, to The Algorithm Ways of Solving the account for varying offsets from the beginning of input. Problem Encoding Operations The expression designed to match when its offset is Performance Expression n + 1 from the beginning of the input could match input Optimization Performance that is actually at a different modular offset. Analysis

Implementation This is mitigated somewhat if the expression can be and Usage Common Use Cases anchored by long constant strings, or if the beginning of Caveats input is known. Summary FALSE POSITIVES Detecting the Wrong Thing

SCREAM

Rob King

Introduction Expressions processed in this way are inherently looser The Problem The Base64 than their unencoded counterparts. Why? Algorithm Regular Expressions Recall that the expression is encoded three times, to The Algorithm Ways of Solving the account for varying offsets from the beginning of input. Problem Encoding Operations The expression designed to match when its offset is Performance Expression n + 1 from the beginning of the input could match input Optimization Performance that is actually at a different modular offset. Analysis

Implementation This is mitigated somewhat if the expression can be and Usage Common Use Cases anchored by long constant strings, or if the beginning of Caveats input is known. Summary FALSE POSITIVES Detecting the Wrong Thing

SCREAM

Rob King

Introduction Expressions processed in this way are inherently looser The Problem The Base64 than their unencoded counterparts. Why? Algorithm Regular Expressions Recall that the expression is encoded three times, to The Algorithm Ways of Solving the account for varying offsets from the beginning of input. Problem Encoding Operations The expression designed to match when its offset is Performance Expression n + 1 from the beginning of the input could match input Optimization Performance that is actually at a different modular offset. Analysis

Implementation This is mitigated somewhat if the expression can be and Usage Common Use Cases anchored by long constant strings, or if the beginning of Caveats input is known. Summary FALSE POSITIVES Detecting the Wrong Thing

SCREAM

Rob King

Introduction Expressions processed in this way are inherently looser The Problem The Base64 than their unencoded counterparts. Why? Algorithm Regular Expressions Recall that the expression is encoded three times, to The Algorithm Ways of Solving the account for varying offsets from the beginning of input. Problem Encoding Operations The expression designed to match when its offset is Performance Expression n + 1 from the beginning of the input could match input Optimization Performance that is actually at a different modular offset. Analysis

Implementation This is mitigated somewhat if the expression can be and Usage Common Use Cases anchored by long constant strings, or if the beginning of Caveats input is known. Summary INSERTION Breaking Up Input

SCREAM

Rob King

Introduction The Problem The Base64 Most Base64 processors allow the insertion of Algorithm Regular Expressions at arbitrary points of the input. The Algorithm Ways of Solving the The encoded expression can be modified to include an Problem Encoding Operations optional newline after every character, though this is Performance expensive. Expression Optimization Performance Some regular expression engines offer the ability to Analysis strip newlines from input, removing the burden on the Implementation and Usage expression. Common Use Cases Caveats

Summary NEWLINE INSERTION Breaking Up Input

SCREAM

Rob King

Introduction The Problem The Base64 Most Base64 processors allow the insertion of newlines Algorithm Regular Expressions at arbitrary points of the input. The Algorithm Ways of Solving the The encoded expression can be modified to include an Problem Encoding Operations optional newline after every character, though this is Performance expensive. Expression Optimization Performance Some regular expression engines offer the ability to Analysis strip newlines from input, removing the burden on the Implementation and Usage expression. Common Use Cases Caveats

Summary NEWLINE INSERTION Breaking Up Input

SCREAM

Rob King

Introduction The Problem The Base64 Most Base64 processors allow the insertion of newlines Algorithm Regular Expressions at arbitrary points of the input. The Algorithm Ways of Solving the The encoded expression can be modified to include an Problem Encoding Operations optional newline after every character, though this is Performance expensive. Expression Optimization Performance Some regular expression engines offer the ability to Analysis strip newlines from input, removing the burden on the Implementation and Usage expression. Common Use Cases Caveats

Summary NEWLINE INSERTION Breaking Up Input

SCREAM

Rob King

Introduction The Problem The Base64 Most Base64 processors allow the insertion of newlines Algorithm Regular Expressions at arbitrary points of the input. The Algorithm Ways of Solving the The encoded expression can be modified to include an Problem Encoding Operations optional newline after every character, though this is Performance expensive. Expression Optimization Performance Some regular expression engines offer the ability to Analysis strip newlines from input, removing the burden on the Implementation and Usage expression. Common Use Cases Caveats

Summary SUMMARY

SCREAM Using this tool, users can reap several benefits: Rob King Pre-filter encoded data streams, lessening the need for

Introduction expensive decoding. The Problem Protect infrastructure that provides decoding services The Base64 Algorithm but is itself vulnerable Regular Expressions Inspect encoded streams within incurring the overhead The Algorithm Ways of Solving the of decoding. Problem Encoding Operations This talk illustrates some interesting challenges when Performance inspecting encoded streams of data, especially in a Expression Optimization Performance performance-critical way, and gives some possible Analysis solutions. Implementation and Usage This talk illustrates an interesting application of Erlang Common Use Cases Caveats outside of its normal domain of highly parallel soft

Summary real-time applications, and shows how its strengths make it well-suited for this sort of problem. This tool served as a “back door” for getting Erlang into DVLabs. SUMMARY

SCREAM Using this tool, users can reap several benefits: Rob King Pre-filter encoded data streams, lessening the need for

Introduction expensive decoding. The Problem Protect infrastructure that provides decoding services The Base64 Algorithm but is itself vulnerable Regular Expressions Inspect encoded streams within incurring the overhead The Algorithm Ways of Solving the of decoding. Problem Encoding Operations This talk illustrates some interesting challenges when Performance inspecting encoded streams of data, especially in a Expression Optimization Performance performance-critical way, and gives some possible Analysis solutions. Implementation and Usage This talk illustrates an interesting application of Erlang Common Use Cases Caveats outside of its normal domain of highly parallel soft

Summary real-time applications, and shows how its strengths make it well-suited for this sort of problem. This tool served as a “back door” for getting Erlang into DVLabs. SUMMARY

SCREAM Using this tool, users can reap several benefits: Rob King Pre-filter encoded data streams, lessening the need for

Introduction expensive decoding. The Problem Protect infrastructure that provides decoding services The Base64 Algorithm but is itself vulnerable Regular Expressions Inspect encoded streams within incurring the overhead The Algorithm Ways of Solving the of decoding. Problem Encoding Operations This talk illustrates some interesting challenges when Performance inspecting encoded streams of data, especially in a Expression Optimization Performance performance-critical way, and gives some possible Analysis solutions. Implementation and Usage This talk illustrates an interesting application of Erlang Common Use Cases Caveats outside of its normal domain of highly parallel soft

Summary real-time applications, and shows how its strengths make it well-suited for this sort of problem. This tool served as a “back door” for getting Erlang into DVLabs. SUMMARY

SCREAM Using this tool, users can reap several benefits: Rob King Pre-filter encoded data streams, lessening the need for

Introduction expensive decoding. The Problem Protect infrastructure that provides decoding services The Base64 Algorithm but is itself vulnerable Regular Expressions Inspect encoded streams within incurring the overhead The Algorithm Ways of Solving the of decoding. Problem Encoding Operations This talk illustrates some interesting challenges when Performance inspecting encoded streams of data, especially in a Expression Optimization Performance performance-critical way, and gives some possible Analysis solutions. Implementation and Usage This talk illustrates an interesting application of Erlang Common Use Cases Caveats outside of its normal domain of highly parallel soft

Summary real-time applications, and shows how its strengths make it well-suited for this sort of problem. This tool served as a “back door” for getting Erlang into DVLabs. SUMMARY

SCREAM Using this tool, users can reap several benefits: Rob King Pre-filter encoded data streams, lessening the need for

Introduction expensive decoding. The Problem Protect infrastructure that provides decoding services The Base64 Algorithm but is itself vulnerable Regular Expressions Inspect encoded streams within incurring the overhead The Algorithm Ways of Solving the of decoding. Problem Encoding Operations This talk illustrates some interesting challenges when Performance inspecting encoded streams of data, especially in a Expression Optimization Performance performance-critical way, and gives some possible Analysis solutions. Implementation and Usage This talk illustrates an interesting application of Erlang Common Use Cases Caveats outside of its normal domain of highly parallel soft

Summary real-time applications, and shows how its strengths make it well-suited for this sort of problem. This tool served as a “back door” for getting Erlang into DVLabs. SUMMARY Getting Erlang Accepted at DVLabs

SCREAM

Rob King Initially, Erlang suffered from a “not invented here” type Introduction mentality, and a fear that “no one knows Erlang”. The Problem The Base64 Algorithm This tool really couldn’t have been written to run Regular Expressions

The Algorithm efficiently very easily in any other language. Ways of Solving the Problem Thanks to this tool, we’ve encouraged adoption of Encoding Operations Erlang in various other portions of the enterprise: Performance Expression A compiler written in Erlang serves as the first stage of Optimization Performance our build process. Analysis Erlang is now being integrated into various testing tools Implementation and Usage that we use for quality assurance and threat analysis. Common Use Cases Interesting tools used for static analysis of binaries are Caveats

Summary being written in Erlang (maybe something to talk about at Erlang Factory 2011?) SUMMARY Getting Erlang Accepted at DVLabs

SCREAM

Rob King Initially, Erlang suffered from a “not invented here” type Introduction mentality, and a fear that “no one knows Erlang”. The Problem The Base64 Algorithm This tool really couldn’t have been written to run Regular Expressions

The Algorithm efficiently very easily in any other language. Ways of Solving the Problem Thanks to this tool, we’ve encouraged adoption of Encoding Operations Erlang in various other portions of the enterprise: Performance Expression A compiler written in Erlang serves as the first stage of Optimization Performance our build process. Analysis Erlang is now being integrated into various testing tools Implementation and Usage that we use for quality assurance and threat analysis. Common Use Cases Interesting tools used for static analysis of binaries are Caveats

Summary being written in Erlang (maybe something to talk about at Erlang Factory 2011?) SUMMARY Getting Erlang Accepted at DVLabs

SCREAM

Rob King Initially, Erlang suffered from a “not invented here” type Introduction mentality, and a fear that “no one knows Erlang”. The Problem The Base64 Algorithm This tool really couldn’t have been written to run Regular Expressions

The Algorithm efficiently very easily in any other language. Ways of Solving the Problem Thanks to this tool, we’ve encouraged adoption of Encoding Operations Erlang in various other portions of the enterprise: Performance Expression A compiler written in Erlang serves as the first stage of Optimization Performance our build process. Analysis Erlang is now being integrated into various testing tools Implementation and Usage that we use for quality assurance and threat analysis. Common Use Cases Interesting tools used for static analysis of binaries are Caveats

Summary being written in Erlang (maybe something to talk about at Erlang Factory 2011?) SUMMARY Getting Erlang Accepted at DVLabs

SCREAM

Rob King Initially, Erlang suffered from a “not invented here” type Introduction mentality, and a fear that “no one knows Erlang”. The Problem The Base64 Algorithm This tool really couldn’t have been written to run Regular Expressions

The Algorithm efficiently very easily in any other language. Ways of Solving the Problem Thanks to this tool, we’ve encouraged adoption of Encoding Operations Erlang in various other portions of the enterprise: Performance Expression A compiler written in Erlang serves as the first stage of Optimization Performance our build process. Analysis Erlang is now being integrated into various testing tools Implementation and Usage that we use for quality assurance and threat analysis. Common Use Cases Interesting tools used for static analysis of binaries are Caveats

Summary being written in Erlang (maybe something to talk about at Erlang Factory 2011?) SUMMARY Getting Erlang Accepted at DVLabs

SCREAM

Rob King Initially, Erlang suffered from a “not invented here” type Introduction mentality, and a fear that “no one knows Erlang”. The Problem The Base64 Algorithm This tool really couldn’t have been written to run Regular Expressions

The Algorithm efficiently very easily in any other language. Ways of Solving the Problem Thanks to this tool, we’ve encouraged adoption of Encoding Operations Erlang in various other portions of the enterprise: Performance Expression A compiler written in Erlang serves as the first stage of Optimization Performance our build process. Analysis Erlang is now being integrated into various testing tools Implementation and Usage that we use for quality assurance and threat analysis. Common Use Cases Interesting tools used for static analysis of binaries are Caveats

Summary being written in Erlang (maybe something to talk about at Erlang Factory 2011?) SUMMARY Getting Erlang Accepted at DVLabs

SCREAM

Rob King Initially, Erlang suffered from a “not invented here” type Introduction mentality, and a fear that “no one knows Erlang”. The Problem The Base64 Algorithm This tool really couldn’t have been written to run Regular Expressions

The Algorithm efficiently very easily in any other language. Ways of Solving the Problem Thanks to this tool, we’ve encouraged adoption of Encoding Operations Erlang in various other portions of the enterprise: Performance Expression A compiler written in Erlang serves as the first stage of Optimization Performance our build process. Analysis Erlang is now being integrated into various testing tools Implementation and Usage that we use for quality assurance and threat analysis. Common Use Cases Interesting tools used for static analysis of binaries are Caveats

Summary being written in Erlang (maybe something to talk about at Erlang Factory 2011?) SUMMARY Getting Erlang Accepted at DVLabs

SCREAM

Rob King Initially, Erlang suffered from a “not invented here” type Introduction mentality, and a fear that “no one knows Erlang”. The Problem The Base64 Algorithm This tool really couldn’t have been written to run Regular Expressions

The Algorithm efficiently very easily in any other language. Ways of Solving the Problem Thanks to this tool, we’ve encouraged adoption of Encoding Operations Erlang in various other portions of the enterprise: Performance Expression A compiler written in Erlang serves as the first stage of Optimization Performance our build process. Analysis Erlang is now being integrated into various testing tools Implementation and Usage that we use for quality assurance and threat analysis. Common Use Cases Interesting tools used for static analysis of binaries are Caveats

Summary being written in Erlang (maybe something to talk about at Erlang Factory 2011?) FIN Contact Information

SCREAM

Rob King

Introduction The Problem The Base64 Algorithm Regular Expressions

The Algorithm Ways of Solving the Rob at Work: [email protected] Problem Encoding Operations Rob at Home: [email protected] Performance Expression Optimization Performance Analysis

Implementation and Usage Common Use Cases Caveats

Summary