High Performance Indexing of Large Heterogeneous Data Sets using GPU

Massimo Bernaschi IAC – National Research Council of Italy

funded by the ISEC programme under GA n° 4000003856 Why a new indexer? • Law Enforcement Agencies need an easy and fast tool to index and search seized disk images

GTC 2015 2 How it works • Extract raw files and metadata from (seized) disk images • Distribute them over multiple systems • Extract plain text and metadata from every file – including deleted files • Create distributed indexes • Provide a friendly user interface to query results • Organize query results in an intuitive visual representation

GTC 2015 3 Architecture Overview

CONNECTIONS’ LEGEND Web GUI DB input/ouput DATABASE HPC cluster Admin Search

SEARCHER MEDIATOR DBMS

COORDINATOR Job Status Scheduler Manager WORKER AGENT INDEX Worker Nodes REPO HPC Cluster

GTC 2015 4 Architecture Overview (cont.) • Coordinator – Manage, coordinate and monitor the whole system • DBMS – Provides the interface to the Database • Mediator – Mediates among all components to ease message communication • Admin – Web UI – Used to manage the infrastruture, create investigation cases and add disk images for indexing • Worker Agent – Runs all worker nodes and provides services for monitoring, starting, stopping, configuring local components • Index Repository – Repository used to store results of all indexing jobs

GTC 2015 5 Architecture Overview (cont.) • Each worker node can run one or more – Image-Extractor • to extract files from seized disk images – Docu-Parser • to trasform extracted documents into plain text and metadata – Docu-Indexer • to create searchable indexes from transformed text and metadata • Managed by worker agents • They are connected to form an Extraction –> Parse –> Indexing Pipeline

GTC 2015 6 Extract – Parse – Indexing Pipeline 1: EXTRACT 2: PARSE 3: INDEXING

Docu - Parser Docu - Indexer

Docu - Parser Docu - Indexer Image - Extractor Docu - Parser Docu - Indexer

Docu - Parser Docu - Indexer

GTC 2015 7 Extract – Parse – Indexing Pipeline 1: EXTRACT 2: PARSE 3: INDEXING

Docu - Parser Docu - Indexer

Docu - Parser Docu - Indexer Image - Extractor Docu - Parser Docu - Indexer

Docu - Parser Docu - Indexer

GTC 2015 8 Disk Image Extraction • Performed by the Image Extractor component • Based on The Sleuth Kit Library® • Supports Unix, Linux, OSx and Windows volumes and file systems • Extracts raw files and file system metadata

SYSTEM METADATA CREATION_DATE FILENAME SIZE PATH LAST_MODIFICATION_DATE The Sleuth Kit Library http://www.sleuthkit.org/ GTC 2015 9 Document Parsing

• Performed by Docu-Parser component • Based on Apache Tika™ Library • Detects and extracts document metadata and structured text • Supports about 1400 file types

DOCUMENT METADATA AUTHOR TITLE KEYWORDS SUMMARY LANGUAGE TOOL Tika Library http://tika.apache.org/ RIGHTS FORMAT GTC 2015 10 Document Indexing • Perfomed by Docu-Indexer component • Based on ™ Libraries • Provides indexing and search capabilities • Index size roughly 20-30% the size of text indexed • Indexes are collected into Index Repository

Apache Lucene™ Libraries http://lucene.apache.org/ GTC 2015 11 Document Searching • Based on Apache Lucene™ Libraries • Provides searching capabilities: – ranked searching – multiple-index searching with merged results – many powerful query types – fielded searching (e.g. title, author, contents) • Working on presenting results through an efficient and interactive interface

GTC 2015 12 HPC Document Indexing • Text analysis requires tokenization, filtering and stop words removal • GPU cards offer huge computing power • Combine CLucene indexing with GPU power to accelerate these steps

Clucene Libraries http://clucene.sourceforge.net/ GTC 2015 13 GPU CUDA Text Analysis M y n a m e i s B o b . \0 One CUDA Thread per character. Each thread applies LowerCase Filter m y n a m e i s b o b . \0 Each CUDA Thread performs Tokenization by locating delimiter positions -1 -1 2 -1 -1 -1 -1 7 -1 -1 10 -1 -1 -1 14 15 Vector processing in order to create two vectors representing start and end token indexes respectively.

Start Indexes End Indexes 2 7 10 14 (related to input text) 0 3 8 11 (related to input text) my my name One CUDA Thread per token. name is Each thread applies StopWords Filter. bob bob GTC 2015 15 (2070 Fermi) GPU CUDA Results 70 9x 60 CLucene GPU+CLucene 50 7x Speed-Up 40 30 20 2x

10 Time(Seconds) 0 4MB 32MB 128MB Plain-Text Size

GTC 2015 16 CUDA and (Java)Lucene 1/2 ● How do they cooperate?

CUDA and (Java)Lucene 2/2 ● How do they cooperate efficiently? o smart and efficient memory transfer using Java Unsafe API Test Environment • 4 Worker Nodes – 4 CPUs / 24 Cores 2.67GHz 48 GB RAM – 2 2070 GPU per node – Running Worker Agents and Extract – Parse – Indexing Pipelines • 1 Management Node – Running all other components

1G Ethernet

GTC 2015 19 Disk Images for Test • Disk images built using the Govdocs1 document set • Govdocs1 digital corpora includes nearly 1 milion freely- redistributable files

gz

ps Govdocs1 File Types ppt

doc

image

pdf 0% 5% 10% 15% 20% 25%

Govdocs1 available @ http://digitalcorpora.org/corpora/files GTC 2015 20 Results

"Extract-Parse-Index Time" 1:12:00 1:04:48 0:57:36 0:50:24 0:43:12 Time 0:36:00 0:28:48 0:21:36 0:14:24 0:07:12 0:00:00 32 80Seized disk image size (in GB)100 210

Disk Image Size (GB) Extract-Parse-Indexing Time DD Time # Files Index Size (GB)

32 00:09:15 00:07:36 58225 4.8 80 00:17:43 00:14:50 117282 8.5 100 00:37:16 00:20:15 186305 12 210 01:02:22 00:33:21 368856 19 29/09/14 21 64 GB Disk Image Indexing text pdf xls others html doc csv xml ps ppt gz image

ISODAC Index

Extracted Text

Disk Image

0 10 20 30 40 50 60 70

SIZE (GB)

GTC 2015 22 Highlights • Streamed In-Memory Extraction+Parse+Indexing – Only indexes written on disks – Much faster than a Map-Reduce based solution • File indexing failure recovery – Files are processed again in case of failure – Selectable files extraction and indexing • Exportable indexes – Generated indexes can be exported and handled to back to investigators

GTC 2015 23 Future Works • Distribute workload based on file type • Enhance scheduling algorithm • Support file extraction filtering • Alternative ad-hoc parser based on file type • CUDA version of (for fast OCR) • Enhanced and interactive results visualization

GTC 2015 24 Tesseract OCR

Profiling with valgrind’s tool callgrind reveals how 3 functions collect try to parallelize these in go parallel approximately 50% self time CUDA execution

In multi-paged documents, openmp: 1 page per thread ProcessPages function takes go parallel or near 98% of total execution get total number of pages and time launch a process per page [email protected]

Please complete the Presenter Evaluation sent to you by email or through the GTC Mobile App. Your feedback is important!

GTC 2015 26 Why Not ? • Hadoop performance • MapReduce performance [Jiang et al. (2010)] [Lin et al. (2012)] • HDFS performance [Dong et al. (2014)] • Seized disk images are neither stored on cluster nor available on a distributed infrastructure • As fast as possible – In-Memory Streaming Pipeline • Only indexes are written to disk • Ad-hoc Recovery process

GTC 2015 27