PDF Document Search Within a Very Large Database

EXAMENSARBETE INOM TEKNIK, GRUNDNIVÅ, 15 HP STOCKHOLM, SVERIGE 2017 PDF document search within a very large database LIZHONG WANG KTH SKOLAN FÖR INFORMATIONS- OCH KOMMUNIKATIONSTEKNIK Abstract Digital search engine, taking a search request from user and then returning a result responded to the request to the user, is indispensable for modern humans who are used to surfing the Internet. On the other hand, the digital document PDF is accepted by more and more people and becomes widely used in this day and age due to the convenience and effectiveness. It follows that, the traditional library has already started to be replaced by the digital one. Combining these two factors, a document based search engine that is able to query a digital document database with an input file is urgently needed. This thesis is a software development that aims to design and implement a prototype of such search engine, and propose latent optimization methods for Loredge. This research can be mainly divided into two categories: Prototype Development and Optimization Analysis. It involves an analytical research on sample documents provided by Loredge and a multi-perspective performance analysis. The prototype contains reading, preprocessing and similarity measurement. The reading part reads in a PDF file by using an imported Java library Apache PDFBox. The preprocessing processes the in-reading document and generates document fingerprint. The similarity measurement is the final stage that measures the similarity between the input fingerprint with all the document fingerprints in the database. The optimization analysis is to balance resource consumptions involving response time, accuracy rate and memory consumption. According to the performance analysis, the shorter the document fingerprint is, the better performance the search program presents. Moreover, a permanent feature database and a similarity based filtration mechanism are proposed to further optimize the program. This project has laid a solid foundation for further study in the document based search engine by providing a feasible prototype and enough relevant experimental data. This study figures out that the following study should mainly focuses on improving the effectiveness of the database access, which involves data entry labeling and search algorithm optimization. Keywords: Portable Document Format, Search, Document Identification, Cosine Similarity, Document Preprocessing, Document Search, Optimization Method, Performance Analysis, Classification, Regression, Loredge. i Abstrakt Digital sökmotor, som tar en sökfråga från användaren och sedan returnerar ett resultat som svarar på den begäran tillbaka till användaren, är oumbärligt för moderna människor som brukar surfa på Internet. Å andra sidan, det digitala dokumentets format PDF accepteras av fler och fler människor, och det används i stor utsträckning i denna tidsålder på grund av bekvämlighet och effektivitet. Det följer att det traditionella biblioteket redan har börjat bytas ut av det digitala biblioteket. När dessa två faktorer kombineras, framgår det att det brådskande behövs en dokumentbaserad sökmotor, som har förmåga att fråga en digital databas om en viss fil. Den här uppsatsen är en mjukvaruutveckling som syftar till att designa och implementera en prototyp av en sådan sökmotor, och föreslå relevant optimeringsmetod för Loredge. Den här undersökningen kan huvudsakligen delas in i två kategorier, prototyputveckling och optimeringsanalys. Arbeten involverar en analytisk forskning om exempeldokument som kommer från Loredge och en prestandaanalys utifrån flera perspektiv. Prototypen innehåller läsning, förbehandling och likhetsmätning. Läsningsdelen läser in en PDF-fil med hjälp av en importerad Java bibliotek, Apache PDFBox. Förbehandlingsdelen bearbetar det inlästa dokumentet och genererar ett dokumentfingeravtryck. Likhetsmätningen är det sista steget, som mäter likheten mellan det inlästa fingeravtrycket och fingeravtryck av alla dokument i Loredge databas. Målet med optimeringsanalysen är att balansera resursförbrukningen, som involverar responstid, noggrannhet och minnesförbrukning. Ju kortare ett dokuments fingeravtryck är, desto bättre prestanda visar sökprogram enligt resultat av prestandaanalysen. Dessutom föreslås en permanent databas med fingeravtryck, och en likhetsbaserad filtreringsmekanism för att ytterligare optimera sökprogrammet. Det här projektet har lagt en solid grund för vidare studier om dokumentbaserad sökmotorn, genom att tillhandahålla en genomförbar prototyp och tillräckligt relevanta experimentella data. Den här studie visar att kommande forskning bör huvudsakligen inriktas på att förbättra effektivitet i databasåtkomsten, vilken innefattar data märkning och optimering av sökalgoritm. Nyckelord: Portable Document Format, Sökning, Dokument Identifiering, Cosine Similarity, Dokument Förhandling, Dokument Sökning, Optimering metod, Prestandaanalys, Klassificering, Regression, Loredge ii Contents Chapter 1 Introduction .................................................................... 1 1.1 Background ......................................................................................... 1 1.2 Problem Statement ............................................................................. 3 1.3 Purpose ............................................................................................... 3 1.4 Approach ............................................................................................ 3 1.5 Limitations .......................................................................................... 4 1.6 Delimitations ....................................................................................... 4 1.7 Thesis Outline ..................................................................................... 5 Chapter 2 Background ..................................................................... 7 2.1 Portable Document Format................................................................. 7 2.2 Article ................................................................................................. 8 2.3 Big Data and Database ........................................................................ 9 2.4 Machine Learning.............................................................................. 10 2.5 Apache PDFBOX JAVA Library ............................................................ 14 Chapter 3 Methodology ................................................................ 15 3.1 Research Strategy ............................................................................. 15 3.2 Understanding the Client Requirement............................................. 19 3.3 Data Collection .................................................................................. 19 3.4 Prototype Development .................................................................... 21 3.5 Evaluation Method............................................................................ 22 Chapter 4 Requirement and Data Analysis ................................... 25 4.1 Client Requirements ......................................................................... 25 4.2 Data Analysis ..................................................................................... 27 iii Chapter 5 Prototype Analysis ........................................................ 39 5.1 Prototype Development Result ......................................................... 39 5.2 Document Fingerprint Analysis ......................................................... 42 5.3 Prototype Environment Analysis ....................................................... 46 5.4 Accuracy Analysis .............................................................................. 48 5.5 Response Time Analysis .................................................................... 50 5.6 Memory Consumption Analysis......................................................... 73 5.7 Similarity Analysis ............................................................................. 75 Chapter 6 Discussion ..................................................................... 85 6.1 Methodology and Consequence ....................................................... 85 6.2 Development Environment Discussion .............................................. 87 6.3 Problem Statement Reiteration ........................................................ 88 6.4 Summative Evaluation ...................................................................... 91 References ..................................................................................... 95 iv Chapter 1 Introduction With the exponentially increased development and usage of Digital Library, the trend is that more and more documents are not only published but also stored on the Internet instead of traditional media. There is a critical and complex task of having a digital database search engine, which is able to use a document file as search statement to query a very large database and obtain a file that has the highest similarity to the input one. This thesis represents the task of developing such a document based search engine, analyzing execution results based on different arguments and putting forward proposals aimed at optimizing the search performance. The performance requirement of the search engine is to have short response time and low workload but high accuracy and scalability. Since Portable Document Format (PDF) is one of the most widely used digital document formats, it is released as an open

Load more