E-Discovery Processing Primer

1 Processing in E-Discovery, a Primer Craig Ball1 © 2019 All Rights Reserved Table of Contents Table of Contents .......................................................................................................................... 2 Introduction .................................................................................................................................. 4 Why process ESI in e-discovery? Isn’t it “review ready?” ......................................................... 5 Files ............................................................................................................................................... 6 A Bit About and a Byte Out of Files ............................................................................................ 6 Digital Encoding ......................................................................................................................... 7 Decimal and Binary: Base 10 and Base Two ........................................................................... 7 Bits ......................................................................................................................................... 7 Bytes ...................................................................................................................................... 8 The Magic Decoder Ring called ASCII ..................................................................................... 9 Unicode ................................................................................................................................ 12 Mind the Gap! ...................................................................................................................... 13 Hex ....................................................................................................................................... 13 Base64 ................................................................................................................................. 14 Why Care about Encoding? .................................................................................................. 15 Hypothetical: TaggedyAnn ................................................................................................... 16 File Type Identification ............................................................................................................ 16 File Extensions ..................................................................................................................... 17 Binary File Signatures ........................................................................................................... 17 File Structure ....................................................................................................................... 18 Data Compression ................................................................................................................... 19 Identification on Ingestion .......................................................................................................... 21 1 Adjunct Professor, E-Discovery & Digital Evidence, University of Texas School of Law, Austin, Texas. The author acknowledges the contributions of Diana Ball as editor and of David Sitsky of Nuix for his insights into processing. 2 Media (MIME) Type Detection ................................................................................................ 22 Media Type Tree Structure .................................................................................................. 22 When All Else Fails: Octet Streams and Text Stripping ............................................................ 24 Data Extraction and Document Filters ........................................................................................ 24 Recursion and Embedded Object Extraction ............................................................................... 27 Simple Document................................................................................................................. 28 Structured Document .......................................................................................................... 28 Compound Document .......................................................................................................... 28 Simple Container ................................................................................................................. 29 Container with Text ............................................................................................................. 29 Family Tracking and Unitization: Keeping Up with the Parts ....................................................... 29 Exceptions Reporting: Keeping Track of Failures ......................................................................... 29 Lexical Preprocessing of Extracted Text ...................................................................................... 30 Normalization .......................................................................................................................... 30 Character Normalization ......................................................................................................... 30 Unicode Normalization ............................................................................................................ 30 Diacritical Normalization ......................................................................................................... 31 Case Normalization.................................................................................................................. 31 Impact of Normalization on Discovery Outcomes ................................................................... 32 Time Zone Normalization ............................................................................................................ 33 Parsing and Tokenization ............................................................................................................ 33 Building a Database and Concordance Index .............................................................................. 37 The Database ........................................................................................................................... 37 The Concordance Index ........................................................................................................... 38 Culling and Selecting the Dataset ................................................................................................ 39 Immaterial Item Suppression .................................................................................................. 39 De-NISTing ............................................................................................................................... 40 Cryptographic Hashing: ........................................................................................................... 40 Deduplication .......................................................................................................................... 45 Near-Deduplication .............................................................................................................. 45 Office 365 Deduplication ..................................................................................................... 49 3 Relativity E-Mail Deduplication ............................................................................................ 50 Other Processing Tasks ............................................................................................................... 52 Foreign Language Detection .................................................................................................... 52 Entropy Testing ....................................................................................................................... 52 Decryption ............................................................................................................................... 52 Bad Extension Flagging ............................................................................................................ 52 Color Detection ....................................................................................................................... 52 Hidden Content Flagging ......................................................................................................... 52 N-Gram and Shingle Generation .............................................................................................. 53 Optical Character Recognition (OCR) ....................................................................................... 53 Virus Scanning ......................................................................................................................... 53 Production Processing ................................................................................................................. 53 Illustration 1: E-Discovery Processing Model ............................................................................. 53 Illustration 2: Anatomy of a Word DOCX File .............................................................................. 55 Introduction Talk to lawyers about e-discovery processing and you’ll likely get a blank stare suggesting no clue what you’re talking about. Why would lawyers want anything to do with something so disagreeably technical? Indeed, processing is technical and strikes attorneys as something they need not know. That’s lamentable because processing

Load more